This is the documentation for Cloudera Search CDH 5 Beta 2 and 1.2.0 for CDH 4.
Documentation for other versions is available at Cloudera Documentation.

Deploying Cloudera Search

When you deploy Cloudera Search, SolrCloud partitions your data set into multiple indexes and processes, while simplifying the management through the use of ZooKeeper, resulting in running a cluster of coordinating Solr servers.

  Note: Before you start

This section assumes that you have already completed the process of installing Search using either Installing Cloudera Search with Cloudera Manager or Installing Cloudera Search without Cloudera Manager. You are now about to distribute the processes across multiple hosts. Before completing this process, you may want to review Choosing where to Deploy the Cloudera Search Processes.

Installing and Starting ZooKeeper Server

SolrCloud mode uses a ZooKeeper Service as a highly available, central location for cluster management. For a small cluster, running a ZooKeeper node collocated with the NameNode is recommended. For larger clusters, you may wish to run multiple ZooKeeper servers. If you decide to run multiple ZooKeeper servers, contact Cloudera Support for configuration help or see the section on Clustered (Multi-Server) Setup in the ZooKeeper Administrator's Guide.

If you are using a single ZooKeeper server, install and start the ZooKeeper service by running the commands shown in the "Installing the ZooKeeper Server Package and Starting ZooKeeper on a Single Server" section of Installing the ZooKeeper Packages.

Initializing Solr

Once your ZooKeeper Service is running, configure each Solr node with the ZooKeeper Quorum address or addresses. Provide the ZooKeeper Quorum address for each ZooKeeper server you have deployed. This could be a single address in smaller deployments, or multiple addresses if you chose to deploy additional servers.

Configure the ZooKeeper Quorum address in /etc/default/solr. Edit the property to configure the nodes with the address of the ZooKeeper service. You must make this configuration change for every Solr Server host. An example configuration with three ZooKeeper hosts might appear as follows:

SOLR_ZK_ENSEMBLE=<zkhost1>:2181,<zkhost2>:2181,<zkhost3>:2181/solr

Configuring Solr for Use with HDFS

To set up Solr for use with your established HDFS service, perform the following configurations:

  1. Configure the HDFS URI for Solr to use as a backing store in /etc/default/solr. Edit the following property to configure the location of Solr index data in HDFS. Do this on every Solr Server host:
    SOLR_HDFS_HOME=hdfs://namenodehost:8020/solr

    Be sure to replace namenodehost with the hostname of your HDFS NameNode (as specified by fs.default.name or fs.defaultFS in your conf/core-site.xml file); you may also need to change the port number from the default (8020). On an HA-enabled cluster, you will need to ensure that the HDFS URI you use reflects the designated nameservice utilized by your cluster. This value should be reflected in fs.default.name; instead of a hostname, you would see hdfs://nameservice1 or something similar.

  2. In some cases, such as for configuring Solr to work with HDFS High Availability (HA), you may want to configure Solr's HDFS client. You can do this by setting the HDFS configuration directory in /etc/default/solr. Locate the appropriate HDFS configuration directory on each node, and edit the following property with the absolute path to this directory. Do this on every Solr Server host:
    SOLR_HDFS_CONFIG=/etc/hadoop/conf

    Be sure to replace the path with the correct directory containing the proper HDFS configuration files, core-site.xml and hdfs-site.xml.

Configuring Solr use with Secure HDFS

In addition to the above steps for Configuring Solr for use with HDFS, you will need to perform the following additional steps if security is enabled:
  1. Create the Kerberos principals and Keytab files for every node in your cluster:
    1. Create the Solr principal using either kadmin or kadmin.local (for CDH 4, see Create and Deploy the Kerberos Principals and Keytab Files or for CDH 5, seeCreate and Deploy the Kerberos Principals and Keytab Files for information on using kadmin or kadmin.local).
      kadmin:  addprinc -randkey solr/fully.qualified.domain.name@YOUR-REALM.COM
      kadmin:  xst -norandkey -k solr.keytab solr/fully.qualified.domain.name
  2. Deploy the Kerberos Keytab files on every node in your cluster:
    1. Copy or move the keytab files to a directory that Solr can access, such as /etc/solr/conf.
      $ sudo mv solr.keytab /etc/solr/conf/
      $ sudo chown solr:hadoop /etc/solr/conf/solr.keytab
      $ sudo chmod 400 /etc/solr/conf/solr.keytab
  3. Add Kerberos related settings to /etc/default/solr on every node in your cluster, substituting appropriate values:
    SOLR_KERBEROS_ENABLED=true
    SOLR_KERBEROS_KEYTAB=/etc/solr/conf/solr.keytab
    SOLR_KERBEROS_PRINCIPAL=solr/fully.qualified.domain.name@YOUR-REALM.COM

Creating the /solr Directory in HDFS

Before starting the Cloudera Search server, you need to create the /solr directory in HDFS. The Cloudera Search master runs as solr:solr so it does not have the required permissions to create a top-level directory.

To create the /solr directory in HDFS:
$ sudo -u hdfs hadoop fs -mkdir /solr
$ sudo -u hdfs hadoop fs -chown solr /solr

Initializing ZooKeeper Namespace

Before starting the Cloudera Search server, you need to create the solr namespace in ZooKeeper:
$ solrctl init
  Warning: It must be noted that solrctl init takes a --force option as well. solrctl init --force will clear the Solr data in ZooKeeper and interfere with any running nodes. If you want to clear Solr data from ZooKeeper to start over, be sure to stop the cluster first.

Starting Solr

To start the cluster, start Solr Server on each node:
$ sudo service solr-server restart
After you have started the Cloudera Search Server, the Solr server should be running. To verify that all daemons are running, use the jps tool from the Oracle JDK, which you can obtain from the Java SE Downloads page. If you are running a pseudo-distributed HDFS installation and a Solr search installation on one machine, jps will show the following output:
$ sudo jps -lm
31407 sun.tools.jps.Jps -lm
31236 org.apache.catalina.startup.Bootstrap start

Runtime Solr Configuration

In order to start using Solr for indexing the data, you must configure a collection holding the index. A configuration for a collection requires a solrconfig.xml file, a schema.xml and any helper files may be referenced from the xml files. The solrconfig.xml file contains all of the Solr settings for a given collection, and the schema.xml file specifies the schema that Solr uses when indexing documents. For more details on how to configure it for your data set see http://wiki.apache.org/solr/SchemaXml.

Configuration files for a collection are managed as part of the instance directory. To generate a skeleton of the instance directory run:
$ solrctl instancedir --generate $HOME/solr_configs

You can customize it by directly editing the solrconfig.xml and schema.xml files that have been created in $HOME/solr_configs/conf.

These configuration files are compatible with the standard Solr tutorial example documents.

Once you are satisfied with the configuration, you can make it available for Solr to use by issuing the following command, which uploads the content of the entire instance directory to ZooKeeper:
$ solrctl instancedir --create collection1 $HOME/solr_configs
You can use the solrctl tool to verify that your instance directory uploaded successfully and is available to ZooKeeper. You can use the solrctl to list the contents of an instance directory as follows:
$ solrctl instancedir --list

If you had used the earlier --create command to create a collection1, the --list command should return collection1.

  Important:

Users who are familiar with Apache Solr might configure a collection directly in solr home: /var/lib/solr. While this is possible, Cloudera discourages this and recommends using solrctl instead.

Creating Your First Solr Collection

By default, the Solr server comes up with no collections. Make sure that you create your first collection using the instancedir that you provided to Solr in previous steps by using the same collection name. (numOfShards is the number of SolrCloud shards you want to partition the collection across. The number of shards cannot exceed the total number of Solr servers in your SolrCloud cluster):
$ solrctl collection --create collection1 -s {{numOfShards}}

You should be able to check that the collection is active. For example, for the server myhost.example.com, you should be able to navigate to http://myhost.example.com:8983/solr/collection1/select?q=*%3A*&wt=json&indent=true and verify that the collection is active. Similarly, you should also be able to observe the topology of your SolrCloud using a URL similar to: http://myhost.example.com:8983/solr/#/~cloud.

Adding Another Collection with Replication

To support scaling for query load, create a second collection with replication. Having multiple servers with replicated collections distributes the request load for each shard. Create one shard cluster with a replication factor of two. Your cluster must have at least two running servers to support this configuration, so ensure Cloudera Search is installed on at least two servers before continuing with this process. A replication factor of two causes two copies of the index files to be stored in two different locations.

  1. Generate the config files for the collection:
    $ solrctl instancedir --generate $HOME/solr_configs2
  2. Upload the instance directory to ZooKeeper:
    $ solrctl instancedir --create collection2 $HOME/solr_configs2
  3. Create the second collection:
    $ solrctl collection --create collection2 -s 1 -r 2
  4. Verify the collection is live and that your one shard is being served by two nodes. For example, for the server myhost.example.com, you should receive content from: http://myhost.example.com:8983/solr/#/~cloud.