Preparing to Index Sample Tweets with Cloudera Search

To prepare for indexing tweets with MapReduce or Flume, complete the following steps:

  1. Start a SolrCloud cluster containing at least two servers (this example uses two shards) as described in Deploying Cloudera Search.
  2. On a host running Solr Server, make sure that the SOLR_ZK_ENSEMBLE environment variable is set in /etc/solr/conf/solr-env.sh. For example:
    $ cat /etc/solr/conf/solr-env.sh
    export SOLR_ZK_ENSEMBLE=zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181/solr

    If you are using Cloudera Manager, this is automatically set on hosts with a Solr Server or Gateway role.

  3. Generate the configuration files for the collection, including the tweet-specific schema.xml:
    • Parcel-based Installation:
      $ solrctl instancedir --generate $HOME/tweet_config
      $ cp /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/collection1/conf/schema.xml \
      $HOME/tweet_config/conf
    • Package-based Installation:
      $ solrctl instancedir --generate $HOME/tweet_config
      $ cp /usr/share/doc/search*/examples/solr-nrt/collection1/conf/schema.xml \
      $HOME/tweet_config/conf
  4. Upload the configuration to ZooKeeper:
    $ solrctl instancedir --create tweet_config $HOME/tweet_config/
  5. Create a new collection:
    $ solrctl collection --create tweets -s 2 -c tweet_config
  6. Verify the collection is live. Open the Solr admin web interface in a browser by accessing http://search01.example.com:8983/solr/#/~cloud. Replace solr01.example.com with the name of any host running the Solr Server process. Look for the tweets collection.
  7. Prepare the configuration for use with MapReduce:
    $ cp -r $HOME/tweet_config $HOME/mr_tweet_config
  8. Copy sample tweets to HDFS:
    • Parcel-based Installation:
      $ sudo -u hdfs hdfs dfs -mkdir -p /user/jdoe
      $ sudo -u hdfs hdfs dfs -chown jdoe:jdoe /user/jdoe
      $ hdfs dfs -mkdir -p /user/jdoe/indir
      $ hdfs dfs -put /opt/cloudera/parcels/CDH/share/doc/search*/examples/test-documents/sample-statuses-*.avro \
      /user/jdoe/indir/
      $ hdfs dfs -ls /user/jdoe/indir
    • Package-based Installation:
      $ sudo -u hdfs hdfs dfs -mkdir -p /user/jdoe
      $ sudo -u hdfs hdfs dfs -chown jdoe:jdoe /user/jdoe
      $ hdfs dfs -mkdir -p /user/jdoe/indir
      $ hdfs dfs -put /usr/share/doc/search*/examples/test-documents/sample-statuses-*.avro \
      /user/jdoe/indir/
      $ hdfs dfs -ls /user/jdoe/indir
  9. Ensure that outdir is empty and exists in HDFS:
    $ hdfs dfs -rm -r -skipTrash /user/jdoe/outdir
    $ hdfs dfs -mkdir /user/jdoe/outdir
    $ hdfs dfs -ls /user/jdoe/outdir

The sample tweets are now in HDFS and ready to be indexed. Continue to Using MapReduce Batch Indexing to Index Sample Tweets to index the sample tweets or to Near Real Time (NRT) Indexing Tweets Using Flume to index live tweets from the Twitter firehose.