Near Real Time Indexing Using Flume

Apache Flume with the MorphlinesSolrSink is a flexible, scalable, fault tolerant, transactional, near real-time (NRT) system for processing a continuous stream of records into live search indexes. Latency from the time of data arrival to the time data appears in search query results is typically measured in seconds, and is tunable.

Data flows from Flume sources across the network to Flume sinks (in this case, MorphlinesSolrSink). The sinks extract the relevant data, transform it, and load it into a set of live Solr servers, which in turn serve queries to end users or search applications.

The ETL functionality is flexible and customizable, using chains of morphline commands that pipe records from one transformation command to another. Commands to parse and transform a set of standard data formats such as Avro, CSV, text, HTML, XML, PDF, Word, or Excel, are provided out of the box. You can add additional custom commands and parsers as morphline plugins for other file or data formats. Do this by implementing a simple Java interface that consumes a record (such as a file) as an InputStream plus some headers and contextual metadata. The record consumed by the Java interface is used to generate and output a new record. Any kind of data format can be indexed, any Solr documents for any kind of Solr schema can be generated, and any custom ETL logic can be registered and run. For more information, see the Morphlines Reference Guide.

Routing to multiple Solr collections enables multi-tenancy, and routing to a SolrCloud cluster improves scalability. Flume agents can be co-located with live Solr servers serving end user queries, or deployed on separate industry-standard hardware to improve scalability and reliability. Indexing load can be spread across a large number of Flume agents, and Flume features such as Load balancing Sink Processor can help improve scalability and achieve high availability.

Flume indexing enables low-latency data acquisition and querying. It complements (instead of replaces) use cases based on batch analysis of HDFS data using MapReduce. In many use cases, data flows simultaneously from the producer through Flume into both Solr and HDFS using features such as the Replicating Channel Selector to replicate an incoming flow into two output flows. You can use near real-time ingestion as well as batch analysis tools.

For a more comprehensive discussion of the Flume Architecture, see Large Scale Data Ingestion using Flume.

Flume is included in CDH. If you are using Cloudera Manager, add the Flume service using the Add a Service wizard. For unmanaged environments, see Setting Up Apache Flume Using the Command Line.

For exercises demonstrating using Flume with Solr for NRT indexing, see Cloudera Search Tutorial.