This is the documentation for CDH 4.6.0.
Documentation for other versions is available at Cloudera Documentation.

Installing CDH4 Components

You can install and run the following components with CDH4.0.0:

  • Flume — A distributed, reliable, and available service for efficiently moving large amounts of data as the data is produced. This release provides a scalable conduit to shipping data around a cluster and concentrates on reliable logging. The primary use case is as a logging system that gathers a set of log files on every machine in a cluster and aggregates them to a centralized persistent store such as HDFS.
  • Sqoop — A tool that imports data from relational databases into Hadoop clusters. Using JDBC to interface with databases, Sqoop imports the contents of tables into a Hadoop Distributed File System (HDFS) and generates Java classes that enable users to interpret the table's schema. Sqoop can also export records from HDFS to a relational database.
  • Sqoop 2 — A server-based tool for transferring data between Hadoop and relational databases. You can use Sqoop 2 to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data with Hadoop MapReduce, and then export it back into an RDBMS.
  • HCatalog — A tool that provides table data access for CDH components such as Pig and MapReduce.
  • Hue — A graphical user interface to work with CDH. Hue aggregates several applications which are collected into a desktop-like environment and delivered as a Web application that requires no client installation by individual users.
  • Pig — Enables you to analyze large amounts of data using Pig's query language called Pig Latin. Pig Latin queries run in a distributed way on a Hadoop cluster.
  • Hive — A powerful data warehousing application built on top of Hadoop which enables you to access your data using Hive QL, a language that is similar to SQL.
  • HBase — provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). Cloudera recommends installing HBase in a standalone mode before you try to run it on a whole cluster.
  • Zookeeper — A highly reliable and available service that provides coordination between distributed processes.
  • Oozie — A server-based workflow engine specialized in running workflow jobs with actions that execute Hadoop jobs. A command line client is also available that allows remote administration and management of workflows within the Oozie server.
  • Whirr — Provides a fast way to run cloud services.
  • Snappy — A compression/decompression library. You do not need to install Snappy if you are already using the native library, but you do need to configure it; see Snappy Installation for more information.
  • Mahout — A machine-learning tool. By enabling you to build machine-learning libraries that are scalable to "reasonably large" datasets, it aims to make building intelligent applications easier and faster.

To install the CDH4 components, see the following sections:

  • Flume. For more information, see "Flume Installation" in this guide.
  • Sqoop. For more information, see "Sqoop Installation" in this guide.
  • Sqoop 2. For more information, see "Sqoop 2 Installation" in this guide.
  • HCatalog. For more information, see "Installing and Using HCatalog" in this guide.
  • Hue. For more information, see "Hue Installation" in this guide.
  • Pig. For more information, see "Pig Installation" in this guide.
  • Oozie. For more information, see "Oozie Installation" in this guide.
  • Hive. For more information, see "Hive Installation" in this guide.
  • HBase. For more information, see "HBase Installation" in this guide.
  • ZooKeeper. For more information, "ZooKeeper Installation" in this guide.
  • Whirr. For more information, see "Whirr Installation" in this guide.
  • Snappy. For more information, see "Snappy Installation" in this guide.
  • Mahout. For more information, see "Mahout Installation" in this guide.