Lily HBase Near Real Time Indexing for Cloudera Search

The Lily HBase NRT Indexer service is a flexible, scalable, fault-tolerant, transactional, near real-time (NRT) system for processing a continuous stream of HBase cell updates into live search indexes. Typically it takes seconds for data ingested into HBase to appear in search results; this duration is tunable. The Lily HBase Indexer uses SolrCloud to index data stored in HBase. As HBase applies inserts, updates, and deletes to HBase table cells, the indexer keeps Solr consistent with the HBase table contents, using standard HBase replication. The indexer supports flexible custom application-specific rules to extract, transform, and load HBase data into Solr. Solr search results can contain columnFamily:qualifier links back to the data stored in HBase. This way, applications can use the Search result set to directly access matching raw HBase cells. Indexing and searching do not affect operational stability or write throughput of HBase because the indexing and searching processes are separate and asynchronous to HBase.

The Lily HBase NRT Indexer service must be deployed in an environment with a running HBase cluster, a running SolrCloud cluster, and at least one ZooKeeper cluster. This can be done with or without Cloudera Manager. See Managing Services for more information on adding services such as the Lily HBase Indexer Service.

Enabling Cluster-wide HBase Replication

The Lily HBase Indexer is implemented using HBase replication, presenting indexers as RegionServers of the worker cluster. This requires HBase replication on the HBase cluster, as well as the individual tables to be indexed. To enable replication:

Cloudera Manager:

  1. Go to HBase service > Configuration > Category > Backup
  2. Select the Enable Replication checkbox.
  3. Set Replication Source Ratio to 1.
  4. Set Replication Batch Size to 1000.
  5. Click Save Changes.
  6. Restart the HBase service (HBase service > Actions > Restart).

Unmanaged:

  1. Add the following properties within the <configuration> tags in /etc/hbase/conf/hbase-site.xml on every HBase cluster node:
      <property>
        <name>hbase.replication</name>
        <value>true</value>
      </property>
      <!-- Source ratio of 100% makes sure that each SEP consumer is actually
           used (otherwise, some can sit idle, especially with small clusters) -->
      <property>
        <name>replication.source.ratio</name>
        <value>1.0</value>
      </property>
      <!-- Maximum number of hlog entries to replicate in one go. If this is
           large, and a consumer takes a while to process the events, the
           HBase rpc call will time out. -->
      <property>
        <name>replication.source.nb.capacity</name>
        <value>1000</value>
      </property>
    </configuration>
    
  2. Restart the HBase services on all HBase cluster nodes.

Adding the Lily HBase Indexer Service in Cloudera Manager

In Cloudera Manager, the Lily HBase Indexer service is called Key-Value Store Indexer, and the service role is called Lily HBase Indexer. To add the service, follow the instructions in Adding a Service.

Pointing a Lily HBase NRT Indexer Service at an HBase Cluster

Before starting Lily HBase NRT Indexer services, you must configure individual services with the location of the ZooKeeper ensemble that is used by the target HBase cluster. In Cloudera Manager, this is handled automatically when you set the HBase service dependency (Key-Value Store Indexer service > Configuration).

For unmanaged environments:

  1. Add the following property to /etc/hbase-solr/conf/hbase-indexer-site.xml on the hosts you are using to run the Lily HBase Indexer service. Replace the ZooKeeper quorum with the value for hbase.zookeeper.quorum from /etc/hbase/conf/hbase-site.xml.

    Unlike other ZooKeeper quorum configuration properties, the hbase.zookeeper.quorum property does not include the ZooKeeper port or Znode:

    <property>
       <name>hbase.zookeeper.quorum</name>
       <value>zk01.example.com,zk02.example.com,zk03.example.com</value>
    </property> 
  2. Configure all Lily HBase Indexers to use a ZooKeeper quorum to coordinate with one another. You can use the same ZooKeeper quorum as the HBase service. Add the following property to /etc/hbase-solr/conf/hbase-indexer-site.xml, and replace the hostnames with your ZooKeeper hostnames.

    For this configuration property, the ZooKeeper designation includes the port:

    <property>
       <name>hbaseindexer.zookeeper.connectstring</name>
       <value>zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181</value>
    </property> 

Configuring Lily HBase Indexer Security

Beginning with CDH 5.4 the Lily HBase Indexer includes an HTTP interface for the list-indexers, create-indexer, update-indexer, and delete-indexer commands. This interface can be secured with Kerberos for authentication and Apache Sentry policy files for authorization.

Configuring Lily HBase Indexer Service to Use Kerberos Authentication

To configure the Lily HBase Indexer to use Kerberos authentication, you must create principals and keytabs and then modify certain configuration properties. If you are using Cloudera Manager to manage your cluster, much of this is handled automatically. For unmanaged environments, you must generate the Kerberos principals and keytabs manually.

For more an overview of Kerberos concepts (including principals and keytabs), see Kerberos Security Artifacts Overview.

To enable Kerberos authentication for the Lily HBase Indexer service:

Cloudera Manager:

  1. Go to Key-Value Store Indexer service > Configuration > Category > Security.
  2. Select the kerberos option for HBase Indexer Secure Authentication.
  3. Click Save Changes.
  4. Go to Administration > Security > Kerberos Credentials.
  5. Click Generate Missing Credentials.
  6. Restart the indexer service (Key-Value Store Indexer service > Actions > Restart).

Unmanaged:

  1. Create principals and keytabs. To complete these steps you must have Kerberos administrator credentials. Perform this procedure on each Lily HBase Indexer host:
    1. Launch the kadmin utility after authenticating with your Kerberos administrator credentials. For example:
      $ kinit jdoe/admin@EXAMPLE.COM
      Password for jdoe/admin@EXAMPLE.COM:
      $ kadmin
      Authenticating as principal jdoe/admin@EXAMPLE.COM with password.
      Password for jdoe/admin@EXAMPLE.COM:
      kadmin:  
    2. Create Lily HBase Indexer service user principals using the format hbase/lily01.example.com@EXAMPLE.COM. This principal is used to authenticate with the Hadoop cluster. Replace lily01.example.com with the Lily HBase Indexer host, and EXAMPLE.COM with your Kerberos realm name:
      kadmin: addprinc -randkey hbase/lily01.example.com@EXAMPLE.COM
    3. Create an HTTP service user principal using the format HTTP/lily01.example.com@EXAMPLE.COM. This principal is used to authenticate user requests coming to the Lily HBase Indexer web services. Replace lily01.example.com with the Lily HBase Indexer host, and EXAMPLE.COM with your Kerberos realm name:
      kadmin: addprinc -randkey HTTP/lily01.example.com@EXAMPLE.COM
    4. Create a keytab file containing both the HTTP/ and hbase/ principals. For example:
      kadmin: xst -norandkey -k /tmp/hbase.keytab hbase/lily01.example.com@EXAMPLE.COM \
      HTTP/lily01.example.com@EXAMPLE.COM
    5. Exit the kadmin utility by using the quit or exit command.
    6. Check the keytabs to make sure that both the HTTP/ and hbase/ principals are present, and that they have the same hostname component. For example:
      $ klist -ekt /tmp/hbase.keytab
    7. Copy the hbase.keytab file to the Lily HBase Indexer configuration directory. Make sure that the owner of the hbase.keytab file is the hbase user and the file has owner-only read permissions.
    8. Repeat this procedure on each Lily HBase Indexer host.
  2. Configure each Lily HBase Indexer host to use Kerberos authentication:
    1. Modify the hbase-indexer-site.xml file as follows:
        <property>
          <name>hbaseindexer.authentication.type</name>
          <value>kerberos</value>
        </property>
        <property>
          <name>hbaseindexer.authentication.kerberos.keytab</name>
          <value>hbase.keytab</value>
        </property>
        <property>
          <name>hbaseindexer.authentication.kerberos.principal</name>
          <value>HTTP/lily01.example.com@EXAMPLE.COM</value>
        </property>
        <property>
          <name>hbaseindexer.authentication.kerberos.name.rules</name>
          <value>DEFAULT</value>
        </property>
    2. Set up the Java Authentication and Authorization Service (JAAS) configuration file. Create a jaas.conf file in the configuration directory containing the following settings. Replace the values with the correct ones for your environment:
      Client {
        com.sun.security.auth.module.Krb5LoginModule required
        useKeyTab=true
        useTicketCache=false
        keyTab="/etc/hbase-solr/conf/hbase.keytab"
        principal="hbase/lily01.example.com@EXAMPLE.COM";
      };
    3. Modify hbase-indexer-env.sh in the configuration directory to add the JAAS configuration to the system properties. You can do this by adding -Djava.security.auth.login.config to the HBASE_INDEXER_OPTS. For example:
      HBASE_INDEXER_OPTS = "$HBASE_INDEXER_OPTS -Djava.security.auth.login.config=/etc/hbase-solr/conf/jaas.conf"
    4. Make these changes on each Lily HBase Indexer host.

Configuring Lily HBase Indexer Service to Use Sentry Authorization

The Lily HBase Indexer service uses a file-based access control model using Apache Sentry policy files. If you want to use Sentry for authorization, you must use the indexer HTTP interface.

The Lily HBase Indexer privilege model specifies READ and WRITE privileges for each indexer. The privileges work as follows:

  • If a role has WRITE privilege for indexer1, that role can create, update, or delete indexer1.
  • If a role has READ privilege for indexer1, that role can run the list-indexers command, which will list indexer1 if it exists. If an indexer called indexer2 exists, but the role does not have READ privileges for it, information about indexer2 is filtered out of the response.

For example, see the following sentry-provider.ini policy file:

[groups]
jdoe = admin
psherman = readonly

[roles]
admin = indexer=*
readonly = indexer=*->action=READ

This policy file grants the jdoe user full access to all indexers, and the psherman user read access to all indexers. User jdoe can see all indexers, but cannot create new ones or modify existing ones.

To configure Sentry for the Lily HBase Indexer:

Cloudera Manager:

  1. Go to Key-Value Store Indexer service > Configuration > Category > Policy File Based Sentry.
  2. Check the box labeled Enable Sentry Authorization using Policy Files.
  3. If necessary, edit Sentry Global Policy File to change the HDFS location of the sentry-provider.ini file.
  4. Click Save Changes.
  5. Restart the service (Key-Value Store Indexer service > Actions > Restart).
  6. Upload the sentry-provider.ini file to the specified location in HDFS. For example:
    • Security Enabled:
      $ kinit hdfs@EXAMPLE.COM
      $ hdfs dfs -mkdir -p /user/hbaseindexer/sentry/
      $ hdfs dfs -put /path/to/local/sentry-provider.ini /user/hbaseindexer/sentry/
      $ hdfs dfs -chown -R hbase:hbase /user/hbaseindexer
    • Security Disabled:
      $ sudo -u hdfs hdfs dfs -mkdir -p /user/hbaseindexer/sentry/
      $ sudo -u hdfs hdfs dfs -put /path/to/local/sentry-provider.ini /user/hbaseindexer/sentry/
      $ sudo -u hdfs hdfs dfs -chown -R hbase:hbase /user/hbaseindexer

Unmanaged:

  1. Add the following properties to hbase-indexer-site.xml on each Lily HBase Indexer host:
      <property>
        <name>sentry.hbaseindexer.sentry.site</name>
        <value>/path/to/sentry-site.xml</value>
      </property>
      <property>
        <name>hbaseindexer.rest.resource.package</name>
        <value>org/apache/sentry/binding/hbaseindexer/rest</value>
      </property>
  2. Edit the referenced sentry-site.xml file to set the HDFS location of the sentry-provider.ini policy file:
      <property>
        <name>sentry.hbaseindexer.provider.resource</name>
        <value>/user/hbaseindexer/sentry/sentry-provider.ini</value>
      </property>
    
  3. Upload the sentry-provider.ini file to the specified location in HDFS. For example:
    • Security Enabled:
      $ kinit hdfs@EXAMPLE.COM
      $ hdfs dfs -mkdir -p /user/hbaseindexer/sentry/
      $ hdfs dfs -put /path/to/local/sentry-provider.ini /user/hbaseindexer/sentry/
      $ hdfs dfs -chown -R hbase:hbase /user/hbaseindexer
    • Security Disabled:
      $ sudo -u hdfs hdfs dfs -mkdir -p /user/hbaseindexer/sentry/
      $ sudo -u hdfs hdfs dfs -put /path/to/local/sentry-provider.ini /user/hbaseindexer/sentry/
      $ sudo -u hdfs hdfs dfs -chown -R hbase:hbase /user/hbaseindexer
  4. Restart the Lily HBase Indexer service on each host:
    $ sudo service hbase-solr-indexer restart

Starting the Lily HBase NRT Indexer Service

You can use Cloudera Manager to start the Lily HBase Indexer Service (Key-Value Store Indexer service > Actions > Start). In unmanaged deployments, you can start or restart a Lily HBase Indexer Daemon manually on a host using the following command:

$ sudo service hbase-solr-indexer restart

After starting the Lily HBase NRT Indexer Services, verify that all daemons are running using the jps tool from the Oracle JDK, which you can obtain from the Java SE Downloads page. For example:

$ sudo jps -lm
31407 sun.tools.jps.Jps -lm
26393 com.ngdata.hbaseindexer.Main

Once the service is running, you can create and manage indexers. Continue to Using the Lily HBase NRT Indexer Service.