Indexing a File Containing Tweets with Flume SpoolDirectorySource

SpoolDirectorySource specifies a directory on a local disk that Flume monitors. Flume automatically transfers data from files in this directory to Solr. SpoolDirectorySource sends data using a channel to a sink, in this case a SolrSink.

Delete all existing documents in Solr:

$ sudo /etc/init.d/flume-ng-agent stop
$ solrctl collection --deletedocs collection1

Comment out TwitterSource and HTTPSource in /etc/flume-ng/conf/flume.conf and uncomment SpoolDirectorySource:

# Comment out "agent.sources = twitterSrc"
# Comment out “agent.sources = httpSrc”
“agent.sources = spoolSrc”

Delete any old spool directory and create a new spool directory:
```
$ rm -fr /tmp/myspooldir
$ sudo -u flume mkdir /tmp/myspooldir
```

Restart the Flume Agent:

$ sudo /etc/init.d/flume-ng-agent restart

Send a file containing tweets to the SpoolDirectorySource. To ensure no partial files are ingested, copy and then atomically move files:

Parcel-based Installation:

$ sudo -u flume cp \
/opt/cloudera/parcels/CDH/share/doc/search*/examples/test-documents/sample-statuses-20120906-141433-medium.avro \
/tmp/myspooldir/.sample-statuses-20120906-141433-medium.avro
$ sudo -u flume mv /tmp/myspooldir/.sample-statuses-20120906-141433-medium.avro \
/tmp/myspooldir/sample-statuses-20120906-141433-medium.avro

Package-based Installation:

$ sudo -u flume cp \
/usr/share/doc/search*/examples/test-documents/sample-statuses-20120906-141433-medium.avro \
/tmp/myspooldir/.sample-statuses-20120906-141433-medium.avro
$ sudo -u flume mv /tmp/myspooldir/.sample-statuses-20120906-141433-medium.avro \
/tmp/myspooldir/sample-statuses-20120906-141433-medium.avro

Check the log for status or errors.
```
$ cat /var/log/flume-ng/flume.log
```
Check the completion status.
```
$ find /tmp/myspooldir
```

Use the Cloudera Search GUI. For example, for the localhost, use http://localhost:8983/solr/collection1/select?q=*%3A*&wt=json&indent=true to verify that new tweets have been ingested into Solr.

Categories: ETL | Flume | Ingest | Search | All Categories

Indexing a File Containing Tweets with Flume HTTPSource

Using Hue with Cloudera Search