Developer Resources: Data Processing & Analytics

Storage & Persistence


HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data.

Distributed Big Data Store

Apache HBase is an open-source, distributed, versioned, column-family store modeled after Google's Bigtable. Just as Bigtable leverages the distributed data storage provided by the Google File System/Colossus, HBase provides Bigtable-like capabilities on top of HDFS. Apache ZooKeeper is a server coordination service for HBase.

Data Serialization

Apache Avro is a data serialization framework for Hadoop; it uses JSON for defining data types and protocols, and serializes data in a compact binary format. Schema information can be sent along with the data or maintained separately.

Transformation & Enrichment

Native MapReduce APIs

There are a choice of out-of-the-box APIs available, including the native MapReduce API (Java), Hadoop Pipes (an API for C++), and Hadoop Streaming (other languages; often used with Python frameworks and shell scripts).

MapReduce Frameworks and Abstractions

As alternatives to programming against MapReduce directly, there are several high-level frameworks that abstract MapReduce constructs, including Apache Pig (high-level data flow language), Apache Hive (SQL layer), Apache Crunch (incubating; a Java library for data pipeline operations), and Cascading (a framework for JVM languages).

Workflow Coordination: Apache Oozie

Apache Oozie is a tool for scheduling and coordinating workflow across Hadoop jobs (run via MapReduce API, Sqoop, Pig, Hive, etc.).

Web UI: Hue

Hue is an open source, extensible, web-based interface for Hadoop. It features a file browser for HDFS, an Oozie application for creating workflows and coordinators, a job designer/browser for MapReduce, a Hive and Impala UI, a Shell, a collection of Hadoop APIs, and more.


SQL Query

Data in stored in HDFS or HBase can be queried in SQL via Hive (in batch fashion, see "Transformation" section) or Cloudera Impala (in real time).


Cloudera Search (currently in beta) brings full-text, interactive search and scalable indexing to Apache Hadoop - so non-technical users can use Google-like queries against terabytes of data.

Advanced Analytics

A range of data mining/statistical modeling libraries are available in the form of Apache Mahout, a machine-learning library, and Apache DataFu (incubating), a collection of user-defined functions for working with large-scale data in Hadoop and Pig. SAS is also a very popular tool for Hadoop-based analytics, with the open-source R language also being available.

Most recently, Apache Spark has emerged as a fast, highly parallelized means of doing advanced analytics on HDFS data.