Indexing Data Using Cloudera Search

There are generally two approaches to indexing data using Cloudera Search:

Near real time (NRT) indexing
Batch indexing

Near real time indexing is generally used when new data needs to be returned in query results in time frames measured in seconds, whereas batch indexing is useful for situations where large amounts of data is indexed at regular intervals, or for indexing a new dataset for the first time.

Near real time indexing generally uses a framework such as Apache Flume or Apache Kafka to continuously ingest and index data. The Lily HBase Indexer can also be used for NRT indexing on Apache HBase tables.

Batch indexing usually relies on MapReduce/YARN jobs to periodically index large datasets. The Lily HBase Indexer can also be used for batch indexing HBase tables.

For more information, see:

Near Real Time Indexing
Batch Indexing

Example Morphline Usage

Near Real Time Indexing