This is the documentation for Cloudera Search CDH 5 Beta 2 and 1.2.0 for CDH 4.
Documentation for other versions is available at Cloudera Documentation.

Introducing Cloudera Search

Cloudera Search is one of Cloudera's near-real-time access products. Cloudera Search enables non-technical users to search and explore data stored in or ingested into Hadoop and HBase. Users do not need SQL or programming skills to use Cloudera Search because it provides a simple, full-text interface for searching.

Another benefit of Cloudera Search, compared to stand-alone search solutions, is the fully integrated data processing platform. Search uses the flexible, scalable, and robust storage system included with CDH. This eliminates the need to move larger data sets across infrastructures to address business tasks.

Cloudera Search incorporates Apache Solr, which includes Apache Lucene, SolrCloud, Apache Tika, and Solr Cell. Cloudera Search 1.x is tightly integrated with Cloudera's Distribution, including Apache Hadoop (CDH) and is included with CDH 5. Cloudera Search provides these key capabilities:

  • Near-real-time indexing
  • Batch indexing
  • Simple, full-text data exploration and navigated drill down

Using Search with the CDH infrastructure provides:

  • Simplified infrastructure
  • Better production visibility
  • Quicker insights across various data types
  • Quicker problem resolution
  • Simplified interaction with the ability to open the platform to more users and use cases
  • Scalability, flexibility, and reliability of search services on the same platform as where you can execute other types of workloads on the same data

How Cloudera Search Works

In a near-real-time indexing use case, Cloudera Search indexes events that are streamed through Apache Flume on their way into storage in CDH. Fields and events are mapped to standard Solr indexable schemas. Lucene indexes events, and the integration through Cloudera Search allows the index to be directly written and stored in standard Lucene index files in HDFS. Flume’s capabilities to route events and have data stored in partitions in HDFS can also be applied. Events can be routed and streamed through multiple Flume agents and written to separate Lucene indexers that can write into separate index shards, for better scale when indexing and quicker responses when searching. The indexes are loaded from HDFS to Solr cores, exactly like Solr would have read from local disk. The difference in the design of Cloudera Search is the robust, distributed, and scalable storage layer of HDFS, which helps eliminate costly downtime and allows for flexibility across workloads without having to move data. Search queries can then be submitted to Solr through either the standard Solr API, or through a simple search GUI application, included in Cloudera Search, which can easily be deployed in Hue.

Cloudera Search batch-oriented indexing capabilities can address needs for searching across batch uploaded files or large data sets that are less frequently updated and less in need of near-real-time indexing. For such cases, Cloudera Search includes a highly scalable indexing workflow based on MapReduce. A MapReduce workflow is launched onto specified files or folders in HDFS, and the field extraction and Solr schema mapping is executed during the mapping phase. Reducers use Solr to write the data as a single index or as index shards, depending on your configuration and preferences. Once the indexes are stored in HDFS, they can be queried using standard Solr mechanisms, as previously described above for the near-real-time indexing use case.

The Lily HBase Indexer Service is a flexible, scalable, fault tolerant, transactional, Near Real Time (NRT) oriented system for processing a continuous stream of HBase cell updates into live search indexes. Typically the time between data ingestion using the Flume sink to that content potentially appearing in search results is on the order of seconds, though this duration is tunable. The Lily HBase Indexer uses Solr to index data stored in HBase. As HBase applies inserts, updates, and deletes to HBase table cells, the indexer keeps Solr consistent with the HBase table contents, using standard HBase replication features. The indexer supports flexible custom application-specific rules to extract, transform, and load HBase data into Solr. Solr search results can contain columnFamily:qualifier links back to the data stored in HBase. This way applications can use the Search result set to directly access matching raw HBase cells. Indexing and searching do not affect operational stability or write throughput of HBase because the indexing and searching processes are separate and asynchronous to HBase.

Cloudera Search Features

This section contains information about current Cloudera Search features.

Unified Management and Monitoring with Cloudera Manager

Cloudera Manager provides a unified and centralized management and monitoring experience for both CDH and Cloudera Search. Cloudera Manager simplifies deployment, configuration, and monitoring of your search services. This differs from many existing search solutions that lack management and monitoring capabilities and that fail to provide deep insight into utilization, system health, trending, and various other supportability aspects.

Index Storage in HDFS

Cloudera Search is integrated with HDFS for index storage. Indexes created by Solr/Lucene can be directly written in HDFS with the data, instead of to local disk, thereby providing fault tolerance and redundancy.

Cloudera has optimized Cloudera Search for fast read and write of indexes in HDFS while indexes are served and queried through standard Solr mechanisms. Also, because data and indexes are co-located, once data is found, processing does not require transport or separately managed storage.

Batch Index Creation through MapReduce

To facilitate index creation for large sets of data, Cloudera Search has built-in MapReduce jobs for indexing data stored in HDFS. As a result, the linear scalability of MapReduce is applied to the indexing pipeline.

Real-time and Scalable Indexing at Data Ingest

Cloudera Search provides integration with Flume to support near-real-time indexing. As new events pass through a Flume hierarchy and are written to HDFS, those same events can be written directly to Cloudera Search indexers.

In addition, Flume supports routing events, filtering, and adding annotations on data on its passage to CDH. These features work with Cloudera Search for improved index sharding, index separation, and document-level access control.

Easy Interaction and Data Exploration through Hue

A Cloudera Search GUI is provided as a Hue plug-in, enabling users to interactively query data, view result files, and do faceted exploration. Hue can also schedule standing queries and explore index files. This GUI uses the Cloudera Search API, which is based on the standard Solr API.

Simplified Data Processing for Search Workloads

Cloudera Search relies on Apache Tika for parsing and preparation of many of the standard file formats for indexing. Additionally, Cloudera Search supports Avro, Hadoop Sequence, and Snappy file format mappings, as well as support for Log file formats, JSON, XML, and HTML. Cloudera Search also provides data preprocessing using Morphlines. This built-in support simplifies index configuration for these formats, which you can use for other applications such as MapReduce jobs.

HBase Search

Cloudera Search integrates with HBase, enabling full-text search of data stored in HBase. This functionality, which does not affect HBase performance, is based on a listener that monitors the replication event stream. The listener captures each write or update-replicated event, enabling extraction and mapping. The event is then sent directly to Solr indexers, deployed on HDFS, and written to indexes in HDFS, using the same process as for other indexing workloads of Cloudera Search. The indexes can then immediately be served, enabling near real time search of HBase data.