How-tos

Getting-started advice on a range of Apache Hadoop topics


See below for a selection of how-tos that are known to be popular with users. (Updated Oct. 4, 2013)

How-to: Technologies Description
Run a MapReduce Job in CDH4 Hadoop, Java Write, compile, and run a simple MapReduce job on Hadoop.
Analyze Twitter Data with Apache Hadoop Hadoop, Hive, Flume, Oozie First on a series about designing an end-to-end data pipeline that will enable you to analyze Twitter data.
Analyze Twitter Data with Hue Hadoop, Hue Based on the sample app above; details how the same results can be achieved through Hue in a simpler way. 
Include Third-Party Libraries in Your MapReduce Job MapReduce, Java Learn the best practice for including third-party libraries in your map/reduce task attempts.
Create a CDH Cluster on Amazon EC2 via Cloudera Manager Hadoop, Cloudera Manager, EC2 Set up a fully configured CDH cluster on EC2 from scratch in less than 15 minutes.
Use Apache ZooKeeper to Build Distributed Apps (and Why) ZooKeeper Learn to use ZooKeeper to confidently assess the state of your data, and coordinate your cluster the right way.
Use a SerDe in Apache Hive Hive The SerDe interface allows you to instruct Hive as to how a record should be processed.
Configure Eclipse for Hadoop Contributions Hadoop, Eclipse, Java This how-to covers configuring Eclipse to modify Hadoop’s source.
Resample from a Large Data Set in Parallel (with R on Hadoop) Hadoop, R Implement a Poisson approximation to enable you to train a random forest on an enormous data set.
Use the Apache HBase REST Interface HBase, REST Part 1 in a series explaining how to use HBase's REST interface.
Schedule Recurring Hadoop Jobs with Apache Oozie MapReduce, Oozie Review a simple Oozie coordinator job, and learn how to schedule a recurring job in Hadoop.
Set Up Cloudera Manager for Apache Hive Hive, Cloudera Manager Set up a Hive server for use with Cloudera Manager.
Use Vagrant to Set Up a Virtual Hadoop Cluster Hadoop, Vagrant, Virtualization Set up a virtual multi-node Hadoop environment on your desktop for testing.
Do Apache Flume Performance Tuning Hadoop, Flume Flume concepts that come into play when tuning your Flume flows for maximum performance
Select the Right Hardware for Your New Hadoop Cluster Hadoop, Hardware Learn some of the principles of workload evaluation and the critical role it plays in hardware selection.
Use Eclipse with MapReduce in Cloudera’s QuickStart VM MapReduce, Eclipse, Java, Virtualization

How do you create a MapReduce project in Eclipse and then debug it? Here's how.

Set Up a Hadoop Cluster with Network Encryption Hadoop, Security Step-by-step instructions to help you set up a Hadoop cluster with network encryption (and why).
Automate Your Hadoop Cluster from Java Hadoop, Cloudera Manager, Java Automate the cluster deployment using the Cloudera Manager API.