Developer Resources: Data Processing & Analytics
On this page you'll get an introduction to the major data processing and analysis functions of Apache Hadoop.
> Collection & Ingestion (Data Ingest, RDBMS Connectivity)
> Storage & Persistence (Filesystem, Distributed Big Data Store, Serialization)
> Transformation & Enrichment (MapReduce APIs & Frameworks, Workflow Coordination, Web UI)
> Analysis (Batch/Real-Time SQL Query, Advanced Analytics)
A simple but illustrative end-to-end use case involving all these functions is described in the "Analyzing Twitter Data with Apache Hadoop" series of how-tos. You may also want to view this video overview.
We also highly recommend Tom White's book, Hadoop: The Definitive Guide.
Collection & Ingestion
Data Ingestion
Streaming of data is done via Apache Flume.
- Apache Flume – Architecture of Flume NG
- Analyzing Twitter Data with Apache Hadoop, Part 2: Gathering Data with Flume
- Flume NG Getting Started Guide
Data Connectivity to RDBMS
Data migration from relational data stores is done via Apache Sqoop.
Storage & Persistence
Filesystem
HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data.
- HDFS User Guide
- HDFS Federation
- CDH4 High Availability Guide
Distributed Big Data Store
Apache HBase is an open-source, distributed, versioned, column-family store modeled after Google's Bigtable. Just as Bigtable leverages the distributed data storage provided by the Google File System/Colossus, HBase provides Bigtable-like capabilities on top of HDFS. Apache ZooKeeper is a server coordination service for HBase.
- Apache HBase Do’s and Don’ts
- Official HBase FAQ
- Official HBase Reference Guide
- Streaming Data into Apache HBase using Apache Flume
- Presentation: Using HBase Effectively
- HBase, The Definitive Guide - by Lars George
- ZooKeeper Overview
Data Serialization
Apache Avro is a data serialization framework for Hadoop; it uses JSON for defining data types and protocols, and serializes data in a compact binary format. Schema information can be sent along with the data or maintained separately.
Transformation & Enrichment
Native MapReduce APIs
There are a choice of out-of-the-box APIs available, including the native MapReduce API (Java), Hadoop Pipes (an API for C++), and Hadoop Streaming (other languages; often used with Python frameworks and shell scripts).
- Online Training: Introduction to MapReduce and HDFS
- HDFS User Guide
- Top 10 MapReduce Tips
- Write and Run Your First MapReduce Job
- Understanding MapReduce via Boggle
- A Guide to Python Frameworks for Hadoop
- Writing a Hadoop MapReduce Program In Python
- Pipes tutorial
MapReduce Frameworks and Abstractions
As alternatives to programming against MapReduce directly, there are several high-level frameworks that abstract MapReduce constructs, including Apache Pig (high-level data flow language), Apache Hive (SQL layer), Apache Crunch (incubating; a Java library for data pipeline operations), and Cascading (a framework for JVM languages).
- Official Pig Tutorial
- Online Training: Introduction to Pig
- How-to: Process a Million Songs with Apache Pig
- Official Hive Getting Started Guide
- Online Training: Introduction to Hive
- Querying Semi-structured Data with Apache Hive
- Apache Crunch: A Java Library for Easier MapReduce Programming
- Cascading for the Impatient
Workflow Coordination: Apache Oozie
Apache Oozie is a tool for scheduling and coordinating workflow across Hadoop jobs (run via MapReduce API, Sqoop, Pig, Hive, etc.).
- Introduction to Apache Oozie
- Installing and Running Apache Oozie (via Siddharth Anand)
Web UI: Hue
Hue is an open source, extensible, web-based interface for Hadoop. It features a file browser for HDFS, an Oozie application for creating workflows and coordinators, a job designer/browser for MapReduce, a Hive and Impala UI, a Shell, a collection of Hadoop APIs, and more.
Analysis
SQL Query
Data in stored in HDFS or HBase can be queried in SQL via Hive (in batch fashion, see "Transformation" section) or Cloudera Impala (in real time).
- Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real
- Querying Semi-structured Data with Apache Hive
Advanced Analytics
A range of data mining/statistical modeling libraries are available in the form of Apache Mahout, a machine-learning library, and DataFu, a collection of user-defined functions for working with large-scale data in Hadoop and Pig. SAS is also a very popular tool for Hadoop-based analytics, with the open-source R language also being available.
- Official Mahout Quickstart
- First Steps with Mahout (via Brian McCallister)
- Cloudera ML: Open Source Libraries and Tools for Data Scientists
- Introducing DataFu: an open source collection of useful Apache Pig UDFs
- SAS In-Memory Analytics
- RHadoop: R packages for managing/analyzing Hadoop data
- Using R with Hadoop Streaming