Configuring Lily HBase NRT Indexer Service for Use with Cloudera Search
The Lily HBase NRT Indexer Service is a flexible, scalable, fault tolerant, transactional, Near Real Time (NRT) oriented system for processing a continuous stream of HBase cell updates into live search indexes. Typically it is a matter of seconds from data ingestion into HBase to that content potentially appearing in search results, though this duration is tunable. The Lily HBase Indexer uses SolrCloud to index data stored in HBase. As HBase applies inserts, updates, and deletes to HBase table cells, the indexer keeps Solr consistent with the HBase table contents, using standard HBase replication features. The indexer supports flexible custom application-specific rules to extract, transform, and load HBase data into Solr. Solr search results can contain columnFamily:qualifier links back to the data stored in HBase. This way, applications can use the Search result set to directly access matching raw HBase cells. Indexing and searching do not affect operational stability or write throughput of HBase because the indexing and searching processes are separate and asynchronous to HBase.
The Lily HBase NRT Indexer Service must be deployed in an environment with a running HBase cluster, a running SolrCloud cluster, and at least one ZooKeeper cluster. This can be done with or without Cloudera Manager. See the HBase (Keystore) Indexer Service topic in Managing Clusters with Cloudera Manager for more information.
Enabling cluster-wide HBase replication
The Lily HBase Indexer is implemented using HBase replication, presenting indexers as regionservers of the slave cluster. This requires HBase replication on the HBase cluster, as well as the individual tables to be indexed. An example of settings required for configuring cluster-wide HBase replication is presented in /usr/share/doc/hbase-solr-doc*/demo/hbase-site.xml. You must add these settings to all of the hbase-site.xml configuration files on the HBase cluster, except the replication.replicationsource.implementation property which does not need to be added. For example, you could do this using the Cloudera Manager HBase Indexer Service GUI. After making these updates, restart your HBase cluster.
Pointing an Lily HBase NRT Indexer Service at an HBase cluster that needs to be indexed
Configure individual Lily HBase NRT Indexer Services with the location of a ZooKeeper ensemble that is used for the target HBase cluster. This must be done before starting Lily HBase NRT Indexer Services. Add the following property to /etc/hbase-solr/conf/hbase-indexer-site.xml. Remember to replace hbase-cluster-zookeeper with the actual ensemble string as found in hbase-site.xml configuration file:
<property> <name>hbase.zookeeper.quorum</name> <value>hbase-cluster-zookeeper</value> </property>
Configure all Lily HBase NRT Indexer Services to use a particular ZooKeeper ensemble to coordinate among each other. Add the following property to /etc/hbase-solr/conf/hbase-indexer-site.xml. Remember to replace hbase-cluster-zookeeper:2181 with the actual ensemble string:
<property> <name>hbaseindexer.zookeeper.connectstring</name> <value>hbase-cluster-zookeeper:2181</value> </property>
Starting an Lily HBase NRT Indexer Service
You can use the Cloudera Manager GUI to start Lily HBase NRT Indexer Service on a set of machines. In non-managed deployments you can start an Lily HBase Indexer Daemon manually on the local host with the following command:
sudo service hbase-solr-indexer restart
After starting the Lily HBase NRT Indexer Services, you can verify that all daemons are running using the jps tool from the Oracle JDK, which you can obtain from the Java SE Downloads page. If you are running a pseudo-distributed HDFS installation and an Lily HBase NRT Indexer Service installation on one machine, jps shows the following output:
$ sudo jps -lm 31407 sun.tools.jps.Jps -lm 26393 com.ngdata.hbaseindexer.Main