This is the documentation for CDH 5.0.x. Documentation for other versions is available at Cloudera Documentation.

Using the Lily HBase Batch Indexer for Indexing

Cloudera Search provides the ability to batch index HBase tables using MapReduce jobs. Such batch indexing does not require:

The HBase replication feature
The Lily HBase Indexer Service
Registering a Lily HBase Indexer configuration with the Lily HBase Indexer Service

The indexer supports flexible custom application-specific rules to extract, transform, and load HBase data into Solr. Solr search results can contain columnFamily:qualifier links back to the data stored in HBase. This way, applications can use the Search result set to directly access matching raw HBase cells.

Batch indexing column families of tables in an HBase cluster requires:

Populating an HBase table
Creating a corresponding SolrCloud collection
Creating a Lily HBase Indexer configuration
Creating a Morphline configuration file
Understanding the extractHBaseCells morphline command
Running HBaseMapReduceIndexerTool

Populating an HBase table

After configuring and starting your system, create an HBase table and add rows to it. For example:

$ hbase shell

hbase(main):002:0> create 'record', {NAME => 'data'}
hbase(main):002:0> put 'record', 'row1', 'data', 'value'
hbase(main):001:0> put 'record', 'row2', 'data', 'value2'

Creating a corresponding SolrCloud collection

A SolrCloud collection used for HBase indexing must have a Solr schema that accommodates the types of HBase column families and qualifiers that are being indexed. To begin, consider adding the all-inclusive data field to a default schema. Once you decide on a schema, create a SolrCloud collection using a command of the form:

$ solrctl instancedir --generate $HOME/hbase-collection1
$ edit $HOME/hbase-collection1/conf/schema.xml
$ solrctl instancedir --create hbase-collection1 $HOME/hbase-collection1
$ solrctl collection --create hbase-collection1

Creating a Lily HBase Indexer configuration

Individual Lily HBase Indexers are configured using the hbase-indexer command line utility. Typically, there is one Lily HBase Indexer configuration for each HBase table, but there can be as many Lily HBase Indexer configurations as there are tables and column families and corresponding collections in the SolrCloud. Each Lily HBase Indexer configuration is defined in an XML file such as morphline-hbase-mapper.xml.

To start, an indexer configuration XML file must refer to the MorphlineResultToSolrMapper implementation and also point to the location of a Morphline configuration file, as shown in the following example morphline-hbase-mapper.xml indexer configuration file:

$ cat $HOME/morphline-hbase-mapper.xml

<?xml version="1.0"?>
<indexer table="record" mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper">

   <!-- The relative or absolute path on the local file system to the morphline configuration file. -->
   <!-- Use relative path "morphlines.conf" for morphlines managed by Cloudera Manager -->
   <param name="morphlineFile" value="/etc/hbase-solr/conf/morphlines.conf"/>

   <!-- The optional morphlineId identifies a morphline if there are multiple morphlines in morphlines.conf -->
   <!-- <param name="morphlineId" value="morphline1"/> -->

</indexer>

The Lily HBase Indexer configuration file also supports the standard attributes of any HBase Lily Indexer on the top-level <indexer> element, meaning the attributes table, mapping-type, read-row, unique-key-formatter, unique-key-field, row-field column-family-field. It does not support the <field> element and <extract> elements.

Creating a Morphline Configuration File

After creating an indexer configuration XML file, control its behavior by configuring Morphline ETL transformation commands in a morphlines.conf configuration file. The morphlines.conf configuration file can contain any number of morphline commands. Typically, the first such command is an extractHBaseCells command. The readAvroContainer or readAvro morphline commands are often used to extract Avro data from the HBase byte array. This configuration file can be shared among different applications that use Morphlines.

Note: The following example uses the Kite SDK, which applies to Search for CDH 5 Beta 2 and later. In the case of morphlines.conf files used with Search 1.3 and earlier or Cloudera Search for CDH 5 Beta 1, which uses CDK, the importCommands are different.

For the following morphlines.conf file to apply to CDK, you would replace importCommands : ["org.kitesdk.morphline.**", "com.ngdata.**"] with importCommands : ["com.cloudera.cdk.morphline.**", "com.ngdata.**"].

$ cat /etc/hbase-solr/conf/morphlines.conf

morphlines : [
  {
    id : morphline1
    importCommands : ["org.kitesdk.morphline.**", "com.ngdata.**"]

    commands : [
      {
        extractHBaseCells {
          mappings : [
            {
              inputColumn : "data:*"
              outputField : "data"
              type : string
              source : value
            }

            #{
            #  inputColumn : "data:item"
            #  outputField : "_attachment_body"
            #  type : "byte[]"
            #  source : value
            #}
          ]
        }
      }

      #for avro use with type : "byte[]" in extractHBaseCells mapping above
      #{ readAvroContainer {} }
      #{
      #  extractAvroPaths {
      #    paths : {
      #      data : /user_name
      #    }
      #  }
      #}

      { logTrace { format : "output record: {}", args : ["@{}"] } }
    ]
  }
]

Note: For proper functioning, the morphline must not contain a loadSolr command. The enclosing Lily HBase Indexer must load documents into Solr, rather than the morphline itself.

Understanding the `extractHBaseCells` morphline command

The extractHBaseCells morphline command extracts cells from an HBase Result and transforms the values into a SolrInputDocument. The command consists of an array of zero or more mapping specifications.
Each mapping has:
- The inputColumn parameter, which specifies the data to be used from HBase for populating a field in Solr. It takes the form of a column family name and qualifier, separated by a colon. The qualifier portion can end in an asterisk, which is interpreted as a wildcard. In this case, all matching column-family and qualifier expressions are used. The following are examples of valid inputColumn values:
  - mycolumnfamily:myqualifier
  - mycolumnfamily:my*
  - mycolumnfamily:*
- The outputField parameter specifies the morphline record field to which to add output values. The morphline record field is also known as the Solr document field. Example: "first_name".
- Dynamic output fields are enabled by the outputField parameter ending with a * wildcard. For example:
```
inputColumn : "m:e:*"
outputField : "belongs_to_*"
```
  In this case, if you make these puts in HBase:
```
put 'table_name' , 'row1' , 'm:e:1' , 'foo'
put 'table_name' , 'row1' , 'm:e:9' , 'bar'
```
  Then the fields of the Solr document are as follows:
```
belongs_to_1 : foo
belongs_to_9 : bar
```
- The type parameter defines the datatype of the content in HBase. All input data is stored in HBase as byte arrays, but all content in Solr is indexed as text, so a method for converting from byte arrays to the actual datatype is required. The type parameter can be the name of a type that is supported by org.apache.hadoop.hbase.util.Bytes.toXXX (currently: "byte[]", "int", "long", "string", "boolean", "float", "double", "short", bigdecimal"). Use type "byte[]" to pass the byte array through to the morphline without any conversion.
  - type:byte[] copies the byte array unmodified into the record output field
  - type:int converts with org.apache.hadoop.hbase.util.Bytes.toInt
  - type:long converts with org.apache.hadoop.hbase.util.Bytes.toLong
  - type:string converts with org.apache.hadoop.hbase.util.Bytes.toString
  - type:boolean converts with org.apache.hadoop.hbase.util.Bytes.toBoolean
  - type:float converts with org.apache.hadoop.hbase.util.Bytes.toFloat
  - type:double converts with org.apache.hadoop.hbase.util.Bytes.toDouble
  - type:short converts with org.apache.hadoop.hbase.util.Bytes.toShort
  - type:bigdecimal converts with org.apache.hadoop.hbase.util.Bytes.toBigDecimal
  Alternately the type parameter can be the name of a Java class that implements the com.ngdata.hbaseindexer.parse.ByteArrayValueMapper interface.
- The source parameter determines what portion of an HBase KeyValue is used as indexing input. Valid choices are "value" or "qualifier". When "value" is specified, then the HBase cell value is used as input for indexing. When "qualifier" is specified, then the HBase column qualifier is used as input for indexing. The default is "value".

Running HBaseMapReduceIndexerTool

Run the HBaseMapReduceIndexerTool to index the HBase table using a MapReduce job, as follows:

hadoop --config /etc/hadoop/conf jar \
/usr/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar --conf \
/etc/hbase/conf/hbase-site.xml -D 'mapred.child.java.opts=-Xmx500m' \
--hbase-indexer-file $HOME/morphline-hbase-mapper.xml --zk-host \
127.0.0.1/solr --collection hbase-collection1 --go-live --log4j \
src/test/resources/log4j.properties

Note: For development purposes, use the --dry-run option to run in local mode and print documents to stdout, instead of loading them to Solr. Using this option causes the morphline to execute in the client process without submitting a job to MapReduce. Executing in the client process provides quicker turnaround during early trial and debug sessions.

Note: To print diagnostic information, such as the content of records as they pass through the morphline commands, consider enabling TRACE log level. For example, you can enable TRACE log level diagnostics by adding the following to your log4j.properties file.

log4j.logger.org.kitesdk.morphline=TRACE
log4j.logger.com.ngdata=TRACE

The log4j.properties file can be passed via the --log4j command line option.

Page generated September 3, 2015.