Deploying Cloudera Search

When you deploy Cloudera Search, SolrCloud partitions your data set into multiple indexes and processes, using ZooKeeper to simplify management, resulting in a cluster of coordinating Solr servers.

Installing and Starting ZooKeeper Server

SolrCloud mode uses a ZooKeeper Service as a highly available, central location for cluster management. For a small cluster, running a ZooKeeper host collocated with the NameNode is recommended. For larger clusters, you may want to run multiple ZooKeeper servers. For more information, see Installing the ZooKeeper Packages.

Initializing Solr

Once the ZooKeeper Service is running, configure each Solr host with the ZooKeeper Quorum address or addresses. Provide the ZooKeeper Quorum address for each ZooKeeper server. This could be a single address in smaller deployments, or multiple addresses if you deploy additional servers.

Configure the ZooKeeper Quorum address in /etc/default/solr. Edit the property to configure the hosts with the address of the ZooKeeper service. You must make this configuration change for every Solr Server host. The following example shows a configuration with three ZooKeeper hosts:

SOLR_ZK_ENSEMBLE=<zkhost1>:2181,<zkhost2>:2181,<zkhost3>:2181/solr

Configuring Solr for Use with HDFS

To use Solr with your established HDFS service, perform the following configurations:

  1. Configure the HDFS URI for Solr to use as a backing store in /etc/default/solr. On every Solr Server host, edit the following property to configure the location of Solr index data in HDFS:
    SOLR_HDFS_HOME=hdfs://namenodehost:8020/solr

    Replace namenodehost with the hostname of your HDFS NameNode (as specified by fs.default.name or fs.defaultFS in your conf/core-site.xml file). You may also need to change the port number from the default (8020). On an HA-enabled cluster, ensure that the HDFS URI you use reflects the designated nameservice utilized by your cluster. This value should be reflected in fs.default.name; instead of a hostname, you would see hdfs://nameservice1 or something similar.

  2. In some cases, such as for configuring Solr to work with HDFS High Availability (HA), you may want to configure the Solr HDFS client by setting the HDFS configuration directory in /etc/default/solr. On every Solr Server host, locate the appropriate HDFS configuration directory and edit the following property with the absolute path to this directory :
    SOLR_HDFS_CONFIG=/etc/hadoop/conf

    Replace the path with the correct directory containing the proper HDFS configuration files, core-site.xml and hdfs-site.xml.

Configuring Solr to Use Secure HDFS

In addition to the previous steps for Configuring Solr for use with HDFS, perform the following steps if security is enabled:
  1. Create the Kerberos principals and Keytab files for every host in your cluster:
    1. Create the Solr principal using either kadmin or kadmin.local.
      kadmin:  addprinc -randkey solr/fully.qualified.domain.name@YOUR-REALM.COM
      kadmin:  xst -norandkey -k solr.keytab solr/fully.qualified.domain.name

      For more information, see Step 4: Create and Deploy the Kerberos Principals and Keytab Files

  2. Deploy the Kerberos Keytab files on every host in your cluster:
    1. Copy or move the keytab files to a directory that Solr can access, such as /etc/solr/conf.
      $ sudo mv solr.keytab /etc/solr/conf/
      $ sudo chown solr:hadoop /etc/solr/conf/solr.keytab
      $ sudo chmod 400 /etc/solr/conf/solr.keytab
  3. Add Kerberos-related settings to /etc/default/solr on every host in your cluster, substituting appropriate values:
    SOLR_KERBEROS_ENABLED=true
    SOLR_KERBEROS_KEYTAB=/etc/solr/conf/solr.keytab
    SOLR_KERBEROS_PRINCIPAL=solr/fully.qualified.domain.name@YOUR-REALM.COM

Creating the /solr Directory in HDFS

Before starting the Cloudera Search server, you need to create the /solr directory in HDFS. The Cloudera Search master runs as solr:solr, so it does not have the required permissions to create a top-level directory.

To create the /solr directory in HDFS:
$ sudo -u hdfs hadoop fs -mkdir /solr
$ sudo -u hdfs hadoop fs -chown solr /solr

Initializing the ZooKeeper Namespace

Before starting the Cloudera Search server, you need to create the solr namespace in ZooKeeper:
$ solrctl init

Starting Solr

To start the cluster, start Solr Server on each host:
$ sudo service solr-server restart
After you have started the Cloudera Search Server, the Solr server should be running. To verify that all daemons are running, use the jps tool from the Oracle JDK, which you can obtain from the Java SE Downloads page. If you are running a pseudo-distributed HDFS installation and a Solr search installation on one machine, jps shows the following output:
$ sudo jps -lm
31407 sun.tools.jps.Jps -lm
31236 org.apache.catalina.startup.Bootstrap start

Runtime Solr Configuration

To start using Solr for indexing the data, you must configure a collection holding the index. A configuration for a collection requires a solrconfig.xml file, a schema.xml and any helper files referenced from the xml files. The solrconfig.xml file contains all of the Solr settings for a given collection, and the schema.xml file specifies the schema that Solr uses when indexing documents. For more details on how to configure a collection for your data set, see http://wiki.apache.org/solr/SchemaXml.

Configuration files for a collection are managed as part of the instance directory. To generate a skeleton of the instance directory, run the following command:
$ solrctl instancedir --generate $HOME/solr_configs

You can customize it by directly editing the solrconfig.xml and schema.xml files created in $HOME/solr_configs/conf.

These configuration files are compatible with the standard Solr tutorial example documents.

After configuration is complete, you can make it available to Solr by issuing the following command, which uploads the content of the entire instance directory to ZooKeeper:
$ solrctl instancedir --create collection1 $HOME/solr_configs
Use the solrctl tool to verify that your instance directory uploaded successfully and is available to ZooKeeper. List the contents of an instance directory as follows:
$ solrctl instancedir --list

If you used the earlier --create command to create collection1, the --list command should return collection1.

Creating Your First Solr Collection

By default, the Solr server comes up with no collections. Make sure that you create your first collection using the instancedir that you provided to Solr in previous steps by using the same collection name. numOfShards is the number of SolrCloud shards you want to partition the collection across. The number of shards cannot exceed the total number of Solr servers in your SolrCloud cluster:
$ solrctl collection --create collection1 -s {{numOfShards}}

You should be able to check that the collection is active. For example, for the server myhost.example.com, you should be able to navigate to http://myhost.example.com:8983/solr/collection1/select?q=*%3A*&wt=json&indent=true and verify that the collection is active. Similarly, you should be able to view the topology of your SolrCloud using a URL similar to http://myhost.example.com:8983/solr/#/~cloud.

Adding Another Collection with Replication

To support scaling for the query load, create a second collection with replication. Having multiple servers with replicated collections distributes the request load for each shard. Create one shard cluster with a replication factor of two. Your cluster must have at least two running servers to support this configuration, so ensure Cloudera Search is installed on at least two servers. A replication factor of two causes two copies of the index files to be stored in two different locations.

  1. Generate the config files for the collection:
    $ solrctl instancedir --generate $HOME/solr_configs2
  2. Upload the instance directory to ZooKeeper:
    $ solrctl instancedir --create collection2 $HOME/solr_configs2
  3. Create the second collection:
    $ solrctl collection --create collection2 -s 1 -r 2
  4. Verify that the collection is live and that the one shard is served by two hosts. For example, for the server myhost.example.com, you should receive content from: http://myhost.example.com:8983/solr/#/~cloud.

Creating Replicas of Existing Shards

You can create additional replicas of existing shards using a command of the following form:

solrctl --zk <zkensemble> --solr <target solr server> core \
--create <new core name> -p collection=<collection> -p shard=<shard to replicate>

For example to create a new replica of collection named collection1 that is comprised of shard1, use the following command:

solrctl --zk myZKEnsemble:2181/solr --solr mySolrServer:8983/solr core \
--create collection1_shard1_replica2 -p collection=collection1 -p shard=shard1

Adding a New Shard to a Solr Server

You can use solrctl to add a new shard to a specified solr server.

solrctl --solr http://<target_solr_server>:8983/solr core --create <core_name> \
-p dataDir=hdfs://<nameservice>/<index_hdfs_path> -p collection.configName=<config_name> \
-p collection=<collection_name> -p numShards=<int> -p shard=<shard_id>

Where:

  • target_solr_server: The server to host the new shard
  • core_name: <collection_name><shard_id><replica_id>
  • shard_id: New shard identifier

For example to add a new shard name myShard to a solr server named mySolrServer, of collection named collection1 that is comprised of shard1, use the following command:

solrctl --solr http://mySolrServer:8983/solr core --create myCore \
-p dataDir=hdfs://<nameservice>/<index_hdfs_path> -p collection.configName=<config_name> \
-p collection=myCollection -p numShards=1 -p shard=myShard