Cloudera Search Architecture
Cloudera Search runs as a distributed service on a set of servers, and each server is responsible for a portion of the entire set of content to be searched. The entire set of content is split into smaller pieces, copies are made of these pieces, and the pieces are distributed among the servers. This provides two main advantages:
- Dividing the content into smaller pieces distributes the task of indexing the content among the servers.
- Duplicating the pieces of the whole allows queries to be scaled more effectively and enables the system to provide higher levels of availability.
Each Cloudera Search server can handle requests for information. As a result, a client can send requests to index documents or perform searches to any Search server, and that server routes the request to the correct server.
Each search deployment requires:
- ZooKeeper on one host. You can install ZooKeeper, Search, and HDFS on the same host.
- HDFS on at least one but as many as all hosts. HDFS is commonly installed on all hosts.
- Solr on at least one but as many as all hosts. Solr is commonly installed on all hosts.
More hosts with Solr and HDFS provides benefits of:
- More search host installations doing work.
- More search and HDFS collocation increasing the degree of data locality. More local data provides faster performance and reduces network traffic.
The following graphic illustrates some of the key elements in a typical deployment.
This graphic illustrates:
- A client submit a query over HTTP.
- The response is received by the NameNode and then passed to a DataNode.
- The DataNode distributes the request among other hosts with relevant shards.
- The results of the query are gathered and returned to the client.
Also notice that the:
- Cloudera Manager provides client and server configuration files to other servers in the deployment.
- ZooKeeper server provides information about the state of the cluster and the other hosts running Solr.
The information a client must send to complete jobs varies:
- For queries, a client must have the hostname of the Solr server and the port to use.
- For actions related to collections, such as adding or deleting collections, the name of the collection is required as well.
- Indexing jobs, such as MapReduceIndexer jobs, use a MapReduce driver that starts a MapReduce job. These jobs can also process morphlines, indexing the results to add to Solr.
Cloudera Search Configuration Files
Files on which the configuration of a Cloudera Search deployment are based include:
Solr files stored in ZooKeeper. Copies of these files exist on all Solr servers.
- solrconfig.xml: Contains the parameters for configuring Solr.
- schema.xml: Contains all of the details about which fields your documents can contain, and how those fields should be dealt with when adding documents to the index, or when querying those fields.
Files are copied from hadoop-conf in HDFS configurations to Solr servers:
Cloudera Manager manages the following configuration files:
The following files are used for logging and security configuration:
Search can be deployed using parcels or packages. Some files are always installed to the same location and some files are installed to different locations based on whether the installation is completed using parcels or packages.
Client FilesClient files are always installed to the same location and are required on any host where corresponding services are installed. In a Cloudera Manager environment, Cloudera Manager manages settings. In an unmanaged deployment, all files can be manually edited. All files are found in a subdirectory of /etc/. Client configuration file types and their locations are:
- /etc/solr/conf for Solr client settings files
- /etc/hadoop/conf for HDFS, MapReduce, and YARN client settings files
- /etc/zookeeper/conf for ZooKeeper configuration files