Understanding Cloudera Search
Cloudera Search opens CDH to full-text search and exploration of data in HDFS and Apache HBase. Cloudera Search is powered by Apache Solr, enriching the industry standard open source search solution with Hadoop platform integration, enabling a new generation of Big Data search. Cloudera Search makes it especially easy to query large data sets.
Understanding How Search Fits into Cloudera Offerings
- MapReduce jobs
- Cloudera Impala queries
- Cloudera Search queries
While CDH alone allows storage and access of large data sets, without Cloudera Search, users must create MapReduce jobs. This requires technical knowledge and each job can take minutes or more to run, and the longer run-times associated with MapReduce jobs can interrupt the process of exploring data. To provide a more immediate query and response experience and to eliminate the need to write MapReduce applications, Cloudera offers Real-Time Query or Impala. Impala returns results in seconds rather than minutes.
While Impala is a fast and powerful application, it uses SQL-based querying syntax. For users who are not familiar with SQL, using Impala may be challenging. To provide rapid results for less technical users, there is Cloudera Search. Impala, Hive, and Pig also require a structure, which is applied at query time, whereas Search supports free-text search over any data or fields you have indexed.
Understanding How Search Leverages Existing Infrastructure
Any data already present in a CDH deployment can be indexed and made query-able by Cloudera Search. For data that is not stored in CDH, Cloudera Search offers tools for loading data into the existing infrastructure, as well as the ability to index data as it is moved to HDFS or written to HBase.
By leveraging existing infrastructure, Cloudera Search eliminates the need to create new, redundant structures. Furthermore, Cloudera Search leverages services provided by CDH and Cloudera Manager in such a way that it does not interfere with other tasks running in the same environment. This means that you get all the benefits of reusing existing infrastructure, without the costs and problems associated with running multiple services in the same set of systems.