Your browser is out of date!

Update your browser to view this website correctly. Update my browser now

×

Exercise 6: Cloudera Navigator

Discovery

The first thing you see when you log into Cloudera Navigator is a search tool. It's an excellent way to find data on your cluster, even when you don't know exactly what you're looking for. Go ahead and click the link to 'explore your data'.

You know that the old web server log data you analyzed was a Hive table, so select 'Hive' under 'Source Type', and 'Table' under 'Type'. You're also pretty sure it had 'access log' in the name, so enter this search query at the top and hit enter:

*access*log*

When the results appear, you immediately recognize the tokenized_access_table. That must be the one you queried!

Lineage

Now that you've found the data you were looking for, click on the table and you'll see a graph of the data's lineage. You'll see the tokenized_access_logs table on the right and the underlying file with the same name in HDFS in blue. You'll also see the other Hive table you created from the original file and the query you ran to transform the data between the two. (The different colors represent different source types: yellow data comes from Hive, blue data comes directly from HDFS.)

As you click on the nodes in this graph, more detail will appear. If you click on the tokenized_access_logs table and the intermediate_access_logs table, you'll see arrows for each individual field running through that query. You can see how quickly you could trace the origin of datasets even in a much busier and more complicated environment!

Auditing

Now you've shown where the data came from, but we still need to show what's been done with it. Go to the 'Audits' tab, using the link in the top-right corner.

As you can see, there are hundreds of events that have been recorded, each with details of what was done, by whom, and when. Let's narrow down what we're looking for again. Open the "Filters" menu from below the "Audit Events" heading.

Click the + icon twice to add two new filters. For the first filter, set the property to 'Username' and fill in 'admin' as the value. For the second filter, set the property to 'Operation' and fill in 'QUERY' as the value. Then click 'Apply'.

As you click on the individual results, you can see the exact queries that were executed and all related details.

You can also view and create reports based on the results of these searches on the left-hand corner. There's already a report called "Recent Denied Accesses". If you checked that report now, you may see that in the course of this tutorial, some tools have tried to access a directory called '/user/anonymous' that we haven't set up, and that the services don't have permission to create.

Policies

It's a relief to be able to audit access to your cluster and see there's no unexpected or unauthorized activity going on. But wouldn't it be even better if you could automatically apply policies to data? Let's open the policies tab in the top-right hand corner and create a policy to make the data we just audited easier to find in the future.

Click the + icon to add a new policy, name your policy "Tag Insecure Data". Check the box to enable the policy, and enter the following as the search query:

(permissions:"rwxrwxrwx") AND (sourceType:hdfs) AND (type:file OR type:directory) AND (deleted:false)

This query will detect any files in HDFS that allow anyone to read, write, and execute. It's common for people to set these permissions to make sure everything works, but your organization may want to refine this practice as you move into production or implement more strict practices for some data sets.

To apply this tag on existing data, set the schedule to "Immediate", and check the box "Assign Metadata". Under tags, enter "insecure", and then click "Add Tag". Save the policy.

Return to the search window and search for "insecure", and you will immediately see all files that are in violation of this new policy.

If you would like to automatically apply this tag to data as it changes, return to the policies tab and edit the policy's schedule to be "On Data Change". Then the tag will be applied to any file that is assigned these permissions in the future.

Conclusion:

You've now experienced how to use Cloudera Navigator for discovery of data and metadata. This powerful tool makes it easy to audit access, trace data lineage, and enforce policies.

With more data, and more data formats available in a multi-tenant environment, data lineage and governance are getting challenging. Cloudera Navigator provides enterprise-grade governance that's built into the foundation of Apache Hadoop.

You can learn more about the various management features provided by Cloudera Manager in the Cloudera Administrator Training for Apache Hadoop.