Lily HBase Near Real Time Indexing for Cloudera Search

The Lily HBase NRT Indexer service is a flexible, scalable, fault-tolerant, transactional, near real-time (NRT) system for processing a continuous stream of HBase cell updates into live search indexes. Typically it takes seconds for data ingested into HBase to appear in search results; this duration is tunable. The Lily HBase Indexer uses SolrCloud to index data stored in HBase. As HBase applies inserts, updates, and deletes to HBase table cells, the indexer keeps Solr consistent with the HBase table contents, using standard HBase replication. The indexer supports flexible custom application-specific rules to extract, transform, and load HBase data into Solr. Solr search results can contain columnFamily:qualifier links back to the data stored in HBase. This way, applications can use the Search result set to directly access matching raw HBase cells. Indexing and searching do not affect operational stability or write throughput of HBase because the indexing and searching processes are separate and asynchronous to HBase.

To accommodate the HBase ingest load, you can run as many Lily HBase Indexer services on different hosts as required. Because the indexing work is shared by all indexers, you can scale the service by adding more indexers. You can co-locate Lily HBase Indexer services with Solr servers on the same set of hosts.

The Lily HBase NRT Indexer service must be deployed in an environment with a running HBase cluster, a running SolrCloud cluster (the Solr service in Cloudera Manager), and at least one ZooKeeper quorum. This can be done with or without Cloudera Manager. See Managing Services for more information on adding services such as the Lily HBase Indexer Service.

Enabling Cluster-wide HBase Replication

The Lily HBase Indexer is implemented using HBase replication, presenting indexers as RegionServers of the worker cluster. This requires HBase replication on the HBase cluster, as well as the individual tables to be indexed. To enable replication:

Cloudera Manager:

  1. Go to HBase service > Configuration > Category > Backup
  2. Select the Enable Replication checkbox.
  3. Set Replication Source Ratio to 1.
  4. Set Replication Batch Size to 1000.
  5. Click Save Changes.
  6. Restart the HBase service (HBase service > Actions > Restart).

Unmanaged:

  1. Add the following properties within the <configuration> tags in /etc/hbase/conf/hbase-site.xml on every HBase cluster node:
      <property>
        <name>hbase.replication</name>
        <value>true</value>
      </property>
      <!-- Source ratio of 100% makes sure that each SEP consumer is actually
           used (otherwise, some can sit idle, especially with small clusters) -->
      <property>
        <name>replication.source.ratio</name>
        <value>1.0</value>
      </property>
      <!-- Maximum number of hlog entries to replicate in one go. If this is
           large, and a consumer takes a while to process the events, the
           HBase rpc call will time out. -->
      <property>
        <name>replication.source.nb.capacity</name>
        <value>1000</value>
      </property>
    </configuration>
    
  2. Restart the HBase services on all HBase cluster nodes.

Adding the Lily HBase Indexer Service

In Cloudera Manager, the Lily HBase Indexer service is called Key-Value Store Indexer, and the service role is called Lily HBase Indexer. To add the service, follow the instructions in Adding a Service.

For unmanaged environments, install the packages as described in Installing the Lily HBase Indexer Service.

Configure HBase Dependency for Lily HBase NRT Indexer Service

Before starting Lily HBase NRT Indexer services, you must configure individual services with the location of the ZooKeeper ensemble that is used by the target HBase cluster. In Cloudera Manager, this is handled automatically when you set the HBase Service dependency (Key-Value Store Indexer service > Configuration).

For unmanaged environments:

  1. Add the following property to /etc/hbase-solr/conf/hbase-indexer-site.xml on the hosts you are using to run the Lily HBase Indexer service. Replace the ZooKeeper quorum with the value for hbase.zookeeper.quorum from /etc/hbase/conf/hbase-site.xml.

    Unlike other ZooKeeper quorum configuration properties, the hbase.zookeeper.quorum property does not include the ZooKeeper port or Znode:

    <property>
       <name>hbase.zookeeper.quorum</name>
       <value>zk01.example.com,zk02.example.com,zk03.example.com</value>
    </property> 
  2. Configure all Lily HBase Indexers to use a ZooKeeper quorum to coordinate with one another. You can use the same ZooKeeper quorum as the HBase service. Add the following property to /etc/hbase-solr/conf/hbase-indexer-site.xml, and replace the hostnames with your ZooKeeper hostnames.

    For this configuration property, the ZooKeeper designation includes the port:

    <property>
       <name>hbaseindexer.zookeeper.connectstring</name>
       <value>zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181</value>
    </property> 

Configuring Lily HBase Indexer Security

The Lily HBase indexer supports Kerberos for authentication, and Apache Sentry for authorization. For more information, see Configuring Lily HBase Indexer Security.

Starting the Lily HBase NRT Indexer Service

You can use Cloudera Manager to start the Lily HBase Indexer Service (Key-Value Store Indexer service > Actions > Start). In unmanaged deployments, you can start or restart a Lily HBase Indexer Daemon manually on a host using the following command:

$ sudo service hbase-solr-indexer restart

After starting the Lily HBase NRT Indexer Services, verify that all daemons are running using the jps tool from the Oracle JDK, which you can obtain from the Java SE Downloads page. For example:

$ sudo jps -lm
31407 sun.tools.jps.Jps -lm
26393 com.ngdata.hbaseindexer.Main

Once the service is running, you can create and manage indexers. Continue to Using the Lily HBase NRT Indexer Service.