Managing Collections in Cloudera Search

A collection in Cloudera Search refers to a repository for indexing and querying documents. Collections typically contain the same types of documents with similar schemas. For example, you might create separate collections for email, Twitter data, logs, forum posts, customer interactions, and so on.

Collections are managed using the solrctl command. For a reference to the solrctl commands and options, see solrctl Reference.

Creating a Solr Collection

If you have enabled Apache Sentry for authorization, you must have update permission for the admin collection as well as the collection you are creating. For example, if you want to create a collection named logs, you need the following Sentry permissions:

solr_admin = collection=admin->action=update, collection=logs->action=update

If you want to be able to create any collection, you can use the wildcard permission:

solr_admin = collection=admin->action=update, collection=*->action=update

For more information on configuring Sentry and granting permissions, see Configuring Sentry Authorization for Cloudera Search.

To create a collection:

  1. If you are using Kerberos, kinit as a user with permission to create the collection:
    kinit solradmin@EXAMPLE.COM

    Replace EXAMPLE.COM with your Kerberos realm name.

  2. On a host running a Solr server, make sure that the SOLR_ZK_ENSEMBLE environment variable is set in /etc/solr/conf/solr-env.sh. For example:
    $ cat /etc/solr/conf/solr-env.sh
    export SOLR_ZK_ENSEMBLE=zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181/solr

    If you are using Cloudera Manager, this is automatically set on hosts with a Solr Server or Gateway role.

  3. Generate configuration files for the collection:
    • Using Configs with Sentry:
      solrctl config --create logs_config predefinedTemplateSecure -p immutable=false
    • Using Configs without Sentry:
      solrctl config --create logs_config predefinedTemplate -p immutable=false
    • Using Instance Directories with Sentry:
      solrctl instancedir --generate $HOME/logs_config
      cp $HOME/logs_config/conf/solrconfig.xml.secure $HOME/logs_config/conf/solrconfig.xml
      # Edit the configuration files as needed
      solrctl instancedir --create logs_config $HOME/logs_config
    • Using Instance Directories without Sentry:
      solrctl instancedir --generate $HOME/logs_config
      # Edit the configuration files as needed
      solrctl instancedir --create logs_config $HOME/logs_config

    For more information on configs and instance directories, see Managing Configuration Using Configs or Instance Directories.

  4. Create a new collection using the specified configuration:
    solrctl collection --create logs -s <numShards> -c logs_config

Viewing Existing Solr Collections

You can view existing collections using the solrctl collection --list command. This command does not require Sentry authorization, because it reads the information from ZooKeeper.

Deleting All Documents in a Solr Collection

Deleting all documents in a Solr collection does not delete the collection or its configuration files. It only deletes the index. This can be useful for rapid prototyping of configuration changes in test environments.

If you have enabled Sentry for authorization, you must have update permission for the admin collection as well as the collection in which you are deleting documents. For example, if you want to delete documents in a collection named logs, you need the following Sentry permissions:

solr_admin = collection=admin->action=update, collection=logs->action=update

If you want to be able to delete documents in any collection, you can use the wildcard permission:

solr_admin = collection=admin->action=update, collection=*->action=update

For more information on configuring Sentry and granting permissions, see Configuring Sentry Authorization for Cloudera Search.

To delete all documents in a collection:

  1. If you are using Kerberos, kinit as a user with permission to delete the collection:
    kinit solradmin@EXAMPLE.COM

    Replace EXAMPLE.COM with your Kerberos realm name.

  2. On a host running Solr Server, make sure that the SOLR_ZK_ENSEMBLE environment variable is set in /etc/solr/conf/solr-env.sh. For example:
    $ cat /etc/solr/conf/solr-env.sh
    export SOLR_ZK_ENSEMBLE=zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181/solr

    If you are using Cloudera Manager, this is automatically set on hosts with a Solr Server or Gateway role.

  3. Delete the documents:
    solrctl collection --deletedocs logs

Backing Up and Restoring Solr Collections

CDH 5.9 and higher include a backup/restore mechanism primarily designed to provide disaster recovery capability for Apache Solr. You can create a backup of a Solr collection and restore from this backup if the index is corrupted due to a software bug, or if an administrator accidentally or maliciously deletes a collection or a subset of documents. This procedure can also be used as part of a cluster migration (for example, if you are migrating to a cloud environment), or to recover from a failed upgrade.

For more information, see Backing Up and Restoring Cloudera Search.

Deleting a Solr Collection

Deleting a Solr collection deletes the collection and its index, but does not delete its configuration files.

If you have enabled Sentry for authorization, you must have update permission for the admin collection as well as the collection you are deleting. For example, if you want to delete a collection named logs, you need the following Sentry permissions:

solr_admin = collection=admin->action=update, collection=logs->action=update

If you want to be able to delete any collection, you can use the wildcard permission:

solr_admin = collection=admin->action=update, collection=*->action=update

For more information on configuring Sentry and granting permissions, see Configuring Sentry Authorization for Cloudera Search.

To delete a collection:

  1. If you are using Kerberos, kinit as a user with permission to delete the collection:
    kinit solradmin@EXAMPLE.COM

    Replace EXAMPLE.COM with your Kerberos realm name.

  2. On a host running Solr Server, make sure that the SOLR_ZK_ENSEMBLE environment variable is set in /etc/solr/conf/solr-env.sh. For example:
    $ cat /etc/solr/conf/solr-env.sh
    export SOLR_ZK_ENSEMBLE=zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181/solr

    If you are using Cloudera Manager, this is automatically set on hosts with a Solr Server or Gateway role.

  3. Delete the collection:
    solrctl collection --delete logs