Popular How-tos

Getting-started advice on a range of Apache Hadoop topics


See below for a selection of how-tos that are known to be popular with users. (Updated May 14, 2015)

How-to: Technologies Description
How-to: Translate from MapReduce to Apache Spark Hadoop, MapReduce, Spark The key to getting the most out of Spark is to understand the differences between its RDD API and the original Mapper and Reducer API.
How-to: Create a Simple Hadoop Cluster with VirtualBox Hadoop, VMs Set up a CDH-based Hadoop cluster in less than an hour using VirtualBox and Cloudera Manager.
How-to: Use IPython Notebook with Apache Spark Spark, Python IPython Notebook and Spark’s Python API are a powerful combination for data science.
Select the Right Hardware for Your New Hadoop Cluster Hadoop, Hardware One of the first questions Cloudera customers raise when getting started with Apache Hadoop is how to select appropriate hardware for their new Hadoop clusters.
Run a Simple Apache Spark App in CDH 5 Hadoop, Spark Getting started with Spark (now shipping inside CDH 5) is easy using this simple example.
Use Apache ZooKeeper to Build Distributed Apps (and Why) Hadoop, HBase, ZooKeeper Learn how you can use ZooKeeper to easily and safely implement important features in your distributed software.
Use HBase Bulk Loading, and Why HBase Introduces the basic concepts of the bulk loading feature, present two use cases, and propose two examples.
Analyze Twitter Data with Apache Hadoop Hadoop, Hive, Flume, Oozie First on a series about designing an end-to-end data pipeline that will enable you to analyze Twitter data.
Analyze Twitter Data with Hue Hadoop, Hue Based on the sample app above; details how the same results can be achieved through Hue in a simpler way. 
Include Third-Party Libraries in Your MapReduce Job MapReduce, Java Learn the best practice for including third-party libraries in your map/reduce task attempts.
Use Apache ZooKeeper to Build Distributed Apps (and Why) ZooKeeper Learn to use ZooKeeper to confidently assess the state of your data, and coordinate your cluster the right way.
Use a SerDe in Apache Hive Hive The SerDe interface allows you to instruct Hive as to how a record should be processed.
Configure Eclipse for Hadoop Contributions Hadoop, Eclipse, Java This how-to covers configuring Eclipse to modify Hadoop’s source.
Resample from a Large Data Set in Parallel (with R on Hadoop) Hadoop, R Implement a Poisson approximation to enable you to train a random forest on an enormous data set.
Use the Apache HBase REST Interface HBase, REST Part 1 in a series explaining how to use HBase's REST interface.
Schedule Recurring Hadoop Jobs with Apache Oozie MapReduce, Oozie Review a simple Oozie coordinator job, and learn how to schedule a recurring job in Hadoop.
Set Up Cloudera Manager for Apache Hive Hive, Cloudera Manager Set up a Hive server for use with Cloudera Manager.
Use Vagrant to Set Up a Virtual Hadoop Cluster Hadoop, Vagrant, Virtualization Set up a virtual multi-node Hadoop environment on your desktop for testing.
Do Apache Flume Performance Tuning Hadoop, Flume Flume concepts that come into play when tuning your Flume flows for maximum performance
Use Eclipse with MapReduce in Cloudera’s QuickStart VM MapReduce, Eclipse, Java, Virtualization

How do you create a MapReduce project in Eclipse and then debug it? Here's how.

> See All How-Tos