Preparing to Index Sample Tweets with Cloudera Search

In this section of the Cloudera Search tutorial, you will create a collection for tweets. The remaining examples in the tutorial use the same collection, so make sure that you follow these instructions carefully.

Configuring Sentry for Tweet Collection

If you have enabled Apache Sentry for authorization, you must have update permission for the admin collection as well as the collection you are creating (cloudera_tutorial_tweets in this example). You can also use the wildcard (*) to grant permissions to create any collection.

For more information on configuring Sentry and granting permissions, see Configuring Sentry Authorization for Cloudera Search.

To grant your user account (jdoe in this example) the necessary permissions:

  1. Switch to the Sentry admin user (solr in this example) using kinit:
    $ kinit solr@EXAMPLE.COM
  2. Grant update privileges to the cloudera_tutorial_role role for the admin and cloudera_tutorial_tweets collections:
    $ solrctl sentry --grant-privilege cloudera_tutorial_role 'collection=admin->action=update'
    $ solrctl sentry --grant-privilege cloudera_tutorial_role 'collection=cloudera_tutorial_tweets->action=update'
    The cloudera_tutorial_role role was created in Configuring Sentry for Test Collection. For more information on the Sentry privilege model for Cloudera Search, see Authorization Privilege Model for Solr.

Create a Collection for Tweets

  1. On a host with Solr Server installed, make sure that the SOLR_ZK_ENSEMBLE environment variable is set in /etc/solr/conf/solr-env.sh. For example:
    $ cat /etc/solr/conf/solr-env.sh
    export SOLR_ZK_ENSEMBLE=zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181/solr

    If you are using Cloudera Manager, this is automatically set on hosts with a Solr Server or Gateway role.

  2. If you are using Kerberos, kinit as the user that has privileges to create the collection:
    $ kinit jdoe@EXAMPLE.COM

    Replace EXAMPLE.COM with your Kerberos realm name.

  3. Generate the configuration files for the collection, including the tweet-specific schema.xml:
    • Parcel-based Installation:
      $ solrctl instancedir --generate $HOME/cloudera_tutorial_tweets_config
      $ cp /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/collection1/conf/schema.xml \
      $HOME/cloudera_tutorial_tweets_config/conf
    • Package-based Installation:
      $ solrctl instancedir --generate $HOME/cloudera_tutorial_tweets_config
      $ cp /usr/share/doc/search*/examples/solr-nrt/collection1/conf/schema.xml \
      $HOME/cloudera_tutorial_tweets_config/conf
  4. If you are using Apache Sentry for authorization, overwrite solrconfig.xml with solrconfig.xml.secure. If you omit this step, Sentry authorization is not enabled for the collection::
    $ cp $HOME/cloudera_tutorial_tweets_config/conf/solrconfig.xml.secure $HOME/cloudera_tutorial_tweets_config/conf/solrconfig.xml
  5. Upload the configuration to ZooKeeper:
    $ solrctl instancedir --create cloudera_tutorial_tweets_config $HOME/cloudera_tutorial_tweets_config
  6. Create a new collection with two shards (specified by the -s parameter) using the named configuration (specified by the -c parameter):
    $ solrctl collection --create cloudera_tutorial_tweets -s 2 -c cloudera_tutorial_tweets_config
  7. Verify that the collection is live. Open the Solr admin web interface in a browser by accessing the following URL:
    • Security Enabled: https://search01.example.com:8985/solr/#/~cloud
    • Security Disabled: https://search01.example.com:8983/solr/#/~cloud
    If you have security enabled on your cluster, enter the credentials for the solr@EXAMPLE.COM principal when prompted. Replace search01.example.com with the name of any host running the Solr Server process. Look for the cloudera_tutorial_tweets collection to verify that it exists.
  8. Prepare the configuration for use with MapReduce:
    $ cp -r $HOME/cloudera_tutorial_tweets_config $HOME/cloudera_tutorial_tweets_mr_config

Copy Sample Tweets to HDFS

  1. Copy the provided sample tweets to HDFS. These tweets will be used to demonstrate the batch indexing capabilities of Cloudera Search:
    • Parcel-based Installation (Security Enabled):
      $ kinit hdfs@EXAMPLE.COM
      $ hdfs dfs -mkdir -p /user/jdoe
      $ hdfs dfs -chown jdoe:jdoe /user/jdoe
      $ kinit jdoe@EXAMPLE.COM
      $ hdfs dfs -mkdir -p /user/jdoe/indir
      $ hdfs dfs -put /opt/cloudera/parcels/CDH/share/doc/search*/examples/test-documents/sample-statuses-*.avro \
      /user/jdoe/indir/
      $ hdfs dfs -ls /user/jdoe/indir
    • Parcel-based Installation (Security Disabled):
      $ sudo -u hdfs hdfs dfs -mkdir -p /user/jdoe
      $ sudo -u hdfs hdfs dfs -chown jdoe:jdoe /user/jdoe
      $ hdfs dfs -mkdir -p /user/jdoe/indir
      $ hdfs dfs -put /opt/cloudera/parcels/CDH/share/doc/search*/examples/test-documents/sample-statuses-*.avro \
      /user/jdoe/indir/
      $ hdfs dfs -ls /user/jdoe/indir
    • Package-based Installation (Security Enabled):
      $ kinit hdfs@EXAMPLE.COM
      $ hdfs dfs -mkdir -p /user/jdoe
      $ hdfs dfs -chown jdoe:jdoe /user/jdoe
      $ kinit jdoe@EXAMPLE.COM
      $ hdfs dfs -mkdir -p /user/jdoe/indir
      $ hdfs dfs -put /usr/share/doc/search*/examples/test-documents/sample-statuses-*.avro \
      /user/jdoe/indir/
      $ hdfs dfs -ls /user/jdoe/indir
    • Package-based Installation (Security Disabled):
      $ sudo -u hdfs hdfs dfs -mkdir -p /user/jdoe
      $ sudo -u hdfs hdfs dfs -chown jdoe:jdoe /user/jdoe
      $ hdfs dfs -mkdir -p /user/jdoe/indir
      $ hdfs dfs -put /usr/share/doc/search*/examples/test-documents/sample-statuses-*.avro \
      /user/jdoe/indir/
      $ hdfs dfs -ls /user/jdoe/indir
  2. Ensure that outdir is empty and exists in HDFS:
    $ hdfs dfs -rm -r -skipTrash /user/jdoe/outdir
    $ hdfs dfs -mkdir /user/jdoe/outdir
    $ hdfs dfs -ls /user/jdoe/outdir

The sample tweets are now in HDFS and ready to be indexed. Continue to Using MapReduce Batch Indexing to Index Sample Tweets to index the sample tweets or to Near Real Time (NRT) Indexing Tweets Using Flume to index live tweets from the Twitter public stream.