This is the documentation for CDH 5.1.x. Documentation for other versions is available at Cloudera Documentation.

Load and Index Data in Search

Execute the script. The path for the script often includes the product version, such as 5.1.0, so path details vary. The script can be found in a subdirectory of the following locations:
  • Packages: /usr/share/doc. If Search for CDH 5.1.0 is installed to the default location using packages, the Quick Start script is found in /usr/share/doc/search-1.0.0+cdh5.1.0+0/quickstart.
  • Parcels: /opt/cloudera/parcels/CDH/share/doc. If Search for CDH 5.1.0 is installed to the default location using parcels, the Quick Start script is found in /opt/cloudera/parcels/CDH/share/doc/search-1.0.0+cdh5.1.0+0/quickstart.

The script uses several defaults. Defaults that you might be most likely to modify include:

Table 1. Script Parameters and Defaults
Parameter Default Notes
NAMENODE_CONNECT `hostname`:8020 For use on an HDFS HA cluster. If you use NAMENODE_CONNECT, do not use NAMENODE_HOST or NAMENODE_PORT.
NAMENODE_HOST `hostname` If you use NAMENODE_HOST and NAMENODE_PORT, then do not use NAMENODE_CONNECT.
NAMENODE_PORT 8020 If you use NAMENODE_HOST and NAMENODE_PORT , then do not use NAMENODE_CONNECT.
ZOOKEEPER_HOST `hostname`  
ZOOKEEPER_PORT 2181  
ZOOKEEPER_ROOT /solr  
HDFS_USER ${HDFS_USER:="${USER}"}  
SOLR_HOME /opt/cloudera/parcels/SOLR/lib/solr  

By default, the script assumes it is running on the NameNode host, which is also running ZooKeeper. Override these defaults with custom values when you start quickstart.sh. For example, to use an alternate namenode and HDFS user ID, you could start the script as follows:

$ NAMENODE_HOST=nnhost HDFS_USER=jsmith ./quickstart.sh

Further discussion of the script

The first time the script runs, it downloads required files such as the Enron data and configuration files. The script can be run again, but on subsequent runs, it uses the Enron information already downloaded, as opposed to downloading this information again. On such subsequent runs, the existing data is used to recreate the enron-email-collection SolrCloud collection.

  Note: Downloading the data from its server, expanding the data, and uploading the data may be time consuming. While your connection and CPU speed ultimately determines the time these processes will require, fifteen minutes is typical and even more time is not uncommon.
The script also generates a Solr configuration and creates a collection in SolrCloud. The following sections describes what the script does and how you can complete these steps manually, if desired.

The script completes the following tasks:

  1. Set variables such as host names and directories.
  2. Create a directory to which to copy the Enron data and then copy that data to this location. This data is about 422 MB and in some tests took around five minutes to download two minutes to untar.
  3. Create directories for the current user in HDFS, change ownership of that directory to the current user, create a directory for the Enron data, and load the Enron data to that directory. In some tests, it took around a minute to copy the approximately 3 GB of untarred data.
  4. Use solrctl to create a template of the instance directory.
  5. Use solrctl to create a new Solr collection for the Enron mail collection.
  6. Create a directory to which the MapReduceBatchIndexer can write results. Ensure the directory is empty.
  7. Use the MapReduceIndexerTool to index the Enron data and push the result live to enron-mail-collection. In some tests, it took around seven minutes to complete this task.
Page generated September 3, 2015.