Indexing a File Containing Tweets with Flume HTTPSource

HTTPSource lets you ingest data into Solr by POSTing a file using HTTP. HTTPSource sends data using a channel to a sink, in this case a SolrSink. For more information, see Flume Solr BlobHandler Configuration Options.

  1. Delete all existing documents in Solr:
    $ sudo /etc/init.d/flume-ng-agent stop
    $ solrctl collection --deletedocs collection1
  2. Comment out TwitterSource in /etc/flume-ng/conf/flume.conf and uncomment HTTPSource:
    # comment out “agent.sources = twitterSrc”
    # uncomment “agent.sources = httpSrc”
  3. Restart the Flume Agent:
    $ sudo /etc/init.d/flume-ng-agent restart
  4. Send a file containing tweets to the HTTPSource:
    • Parcel-based Installation:
      $ curl --data-binary \
      @/opt/cloudera/parcels/CDH/share/doc/search-*/examples/test-documents/sample-statuses-20120906-141433-medium.avro \
      'http://127.0.0.1:5140?resourceName=sample-statuses-20120906-141433-medium.avro' \
      --header 'Content-Type:application/octet-stream' --verbose
    • Package-based Installation:
      $ curl --data-binary \
      @/usr/share/doc/search-*/examples/test-documents/sample-statuses-20120906-141433-medium.avro \
      'http://127.0.0.1:5140?resourceName=sample-statuses-20120906-141433-medium.avro' \
      --header 'Content-Type:application/octet-stream' --verbose
  5. Check the log for status or errors:
    $ cat /var/log/flume-ng/flume.log 

Use the Cloudera Search GUI at http://localhost:8983/solr/collection1/select?q=*%3A*&wt=json&indent=true to verify that new tweets have been ingested into Solr as expected.