Your browser is out of date!

Update your browser to view this website correctly. Update my browser now

×

Exercise 4: Explore log events interactively

Since sales are dropping and nobody knows why, you want to provide a way for people to interactively and flexibly explore data from the website. We can do this by indexing it for use in Apache Solr, where users can do text searches, drill down through different categories, etc. Data can be indexed by Solr in batch using MapReduce, or you can index tables in Apache HBase and get real-time updates. To analyze data from the website, however, we're going to stream the log data in using Flume.

The web log data is a standard web server log which may look something like this:

Solr organizes data similarly to the way a SQL database does. Each record is called a 'document' and consists of fields defined by the schema: just like a row in a database table. Instead of a table, Solr calls it a 'collection' of documents. The difference is that data in Solr tends to be more loosely structured. Fields may be optional, and instead of always matching exact values, you can also enter text queries that partially match a field, just like you're searching for web pages. You'll also see Hue refer to 'shards' - and that's just the way Solr breaks collections up to spread them around the cluster so you can search all your data in parallel.

Here is how you can start real-time-indexing via Cloudera Search and Flume over the sample web server log data and use the Search UI in Hue to explore it:

Create Your Search Index

Ordinarily when you are deploying a new search schema, there are four steps:

  1. Creating an empty configuration

    First, generate the configs by executing the following command:

    > solrctl --zk {{zookeeper_connection_string}}/solr instancedir --generate solr_configs
    

    The result of this command is a skeleton configuration that you can customize to your liking via the conf/schema.xml.

  2. Edit your schema

    The most likely area in conf/schema.xml that you would be interested in is the <fields></fields> section. From this area you can define the fields that are present and searchable in your index.

  3. Uploading your configuration
    > cd /opt/examples/flume
    > solrctl --zk {{zookeeper_connection_string}}/solr instancedir --create live_logs ./solr_configs
    
    You may need to replace the IP addresses with those of your three data nodes.
  4. Creating your collection
    solrctl --zk {{zookeeper_connection_string}}/solr collection --create live_logs -s {{ number of solr servers }}
    
    You may need to replace the IP addresses with those of your three data nodes.

You can verify that you successfully created your collection in Solr by going to Hue, and clicking Search in the top menu

Then click on Indexes from the top right to see all of the indexes/collections.

Now you can see the collection that we just created, live_logs, click on it.

You are now viewing the fields that we defined in our schema.xml file.

Now that you have verified that your search collection/index was created successfully, we can start putting data into it using Flume and Morphlines. Flume is a tool for ingesting streams of data into your cluster from sources such as log files, network streams, and more. Morphlines is a Java library for doing ETL on-the-fly, and it's an excellent companion to Flume. It allows you to define a chain of tasks like reading records, parsing and formatting individual fields, and deciding where to send them, etc. We've defined a morphline that reads records from Flume, breaks them into the fields we want to search on, and loads them into Solr (You can read more about Morphlines here). This example Morphline is defined at /opt/examples/flume/conf/morphline.conf, and we're going to use it to index our records in real-time as they're created and ingested by Flume.

Apache Flume and the Morphline

Now that we have an empty Solr index, and live log events coming in to our fake access.log, we can use Flume and morphlines to load the index with the real-time log data.

The key player in this tutorial is Flume. Flume is a system for collecting, aggregating, and moving large amounts of log data from many different sources to a centralized data source.

With a few simple configuration files, we can use Flume and a morphline (a simple way to accomplish on-the-fly ETL,) to load our data into our Solr index.

(Note: You can use Flume to load many other types of data stores; Solr is just the example we are using for this tutorial.)

Start the Flume agent by executing the following command:

> flume-ng agent \
    --conf /opt/examples/flume/conf \
    --conf-file /opt/examples/flume/conf/flume.conf \
    --name agent1 \
    -Dflume.root.logger=DEBUG,INFO,console

This will start running the Flume agent in the foreground. Once it has started, and is processing records, you should see something like:

Now you can go back to the Hue UI, and click 'Search' from the collection's page:

You will be able to search, drill down into, and browse the events that have been indexed.

If one of these steps fails, please reach out to the Discussion Forum and get help. Otherwise, you can start exploring the log data and understand what is going on.

For our story's sake, we pretend that you started indexing data the same time as you started ingesting it (via Flume) to the platform, so that when your manager escalated the issue, you could immediately drill down into data from the last three days and explore what happened. For example, perhaps you noted a lot of DDOS events and could take the right measures to preempt the attack. Problem solved! Management is fantastically happy with your recent contributions, which of course leads to a great bonus or something similar. :D

Conclusion:

Now you have learned how to use Cloudera Search to allow exploration of data in real time, using Flume and Solr and Morphlines. Further, you now understand how you can serve multiple use cases over the same data - as well as from previous steps: serve multiple data sets to provide bigger insights. The flexibility and multi-workload capability of a Hadoop-based Enterprise Data Hub are some of the core elements that have made Hadoop valuable to organizations world wide.