CDH 5.4.0

Cloudera’s 100% Open Source Hadoop Platform

CDH is Cloudera's open source software distribution and consists of Apache Hadoop and additional key open source projects to ensure you get the most out of Hadoop and your data.

It is the only Hadoop solution to offer unified querying options (including batch processing, interactive SQL, text search, and machine learning) and necessary enterprise security features (such as role-based access controls).

Please note: CDH requires manual installation from the command line.
For a faster, automated installation download Cloudera Manager.

Installation

This section introduces options for installing Cloudera Manager, CDH, and managed services. You can install:
  • Cloudera Manager, CDH, and managed services in a Cloudera Manager deployment. This is the recommended method for installing CDH and managed services.
  • CDH 5 into an unmanaged deployment.

 

Cloudera Manager Deployment

A Cloudera Manager deployment consists of the following software components:
  • Oracle JDK
  • Cloudera Manager Server and Agent packages
  • Supporting database software
  • CDH and managed service software
This section describes the three main installation paths for creating a new Cloudera Manager deployment and the criteria for choosing an installation path. If your cluster already has an installation of a previous version of Cloudera Manager, follow the instructions in Upgrading Cloudera Manager.
The Cloudera Manager installation paths share some common phases, but the variant aspects of each path support different user and cluster host requirements:
  • Demonstration and proof of concept deployments - There are two installation options:
    • Installation Path A - Automated Installation by Cloudera Manager - Cloudera Manager automates the installation of the Oracle JDK, Cloudera Manager Server, embedded PostgreSQL database, and Cloudera Manager Agent, CDH, and managed service software on cluster hosts, and configures databases for the Cloudera Manager Server and Hive Metastore and optionally for Cloudera Management Service roles. This path is recommended for demonstration and proof of concept deployments, but is not recommended for production deployments because its not intended to scale and may require database migration as your cluster grows. To use this method, server and cluster hosts must satisfy the following requirements:
      • Provide the ability to log in to the Cloudera Manager Server host using a root account or an account that has password-less sudo permission.
      • Allow the Cloudera Manager Server host to have uniform SSH access on the same port to all hosts. See Networking and Security Requirements for further information.
      • All hosts must have access to standard package repositories and either archive.cloudera.com or a local repository with the necessary installation files.
    • Installation Path B - Manual Installation Using Cloudera Manager Packages - you install the Oracle JDK and Cloudera Manager Server, and embedded PostgreSQL database packages on the Cloudera Manager Server host. You have two options for installing Oracle JDK, Cloudera Manager Agent, CDH, and managed service software on cluster hosts: manually install it yourself or use Cloudera Manager to automate installation. However, in order for Cloudera Manager to automate installation of Cloudera Manager Agent packages or CDH and managed service software, cluster hosts must satisfy the following requirements:
      • Allow the Cloudera Manager Server host to have uniform SSH access on the same port to all hosts. See Networking and Security Requirements for further information.
      • All hosts must have access to standard package repositories and either archive.cloudera.com or a local repository with the necessary installation files.
  • Production deployments - require you to first manually install and configure a production database for the Cloudera Manager Server and Hive Metastore. There are two installation options:
    • Installation Path B - Manual Installation Using Cloudera Manager Packages - you install the Oracle JDK and Cloudera Manager Server packages on the Cloudera Manager Server host. You have two options for installing Oracle JDK, Cloudera Manager Agent, CDH, and managed service software on cluster hosts: manually install it yourself or use Cloudera Manager to automate installation. However, in order for Cloudera Manager to automate installation of Cloudera Manager Agent packages or CDH and managed service software, cluster hosts must satisfy the following requirements:
      • Allow the Cloudera Manager Server host to have uniform SSH access on the same port to all hosts. See Networking and Security Requirements for further information.
      • All hosts must have access to standard package repositories and either archive.cloudera.com or a local repository with the necessary installation files.
    • Installation Path C - Manual Installation Using Cloudera Manager Tarballs - you install the Oracle JDK, Cloudera Manager Server, and Cloudera Manager Agent software as tarballs and use Cloudera Manager to automate installation of CDH and managed service software as parcels.

Unmanaged Deployment

In an unmanaged deployment, you are responsible for managing all phases of the life cycle of CDH and managed service components on each host: installation, configuration, and service life cycle operations such as start and stop. This section describes alternatives for installing CDH 5 software in an unmanaged deployment.

  • Command-line methods:
    • Download and install the CDH 5 "1-click Install" package
    • Add the CDH 5 repository
    • Build your own CDH 5 repository
    If you use one of these command-line methods, the first (downloading and installing the "1-click Install" package) is recommended in most cases because it is simpler than building or adding a repository. See Installing the Latest CDH 5 Release for detailed instructions for each of these options.
  • Tarball You can download a tarball from CDH downloads. Keep the following points in mind:
    • Installing CDH 5 from a tarball installs YARN.
    • In CDH 5, there is no separate tarball for MRv1. Instead, the MRv1 binaries, examples, etc., are delivered in the Hadoop tarball. The scripts for running MRv1 are in the bin-mapreduce1 directory in the tarball, and the MRv1 examples are in the examples-mapreduce1 directory.

What's New in CDH 5.4.0

  Important:
Upgrading to CDH 5.4.0 and later from any earlier release requires an HDFS metadata upgrade. Be careful to follow all of the upgrade steps as instructed.

For the latest Impala features, see New Features in Impala Version 2.2.0 / CDH 5.4.0.

Operating System Support

CDH 5.4.0 adds support for RHEL and CentOS 6.6. See CDH 5 Requirements and Supported Versions.

Security

The following summarizes new security capabilities in CDH 5.4.0:

  • Secure Hue impersonation support for the Hue HBase application.
  • Redaction of sensitive data from logs, centrally managed by Cloudera Manager, which prevents the WHERE clause in queries from leaking sensitive data into logs and management UIs.
  • Cloudera Manager support for custom Kerberos principals.
  • Kerberos support for Sqoop 2.
  • Kerberos and TLS/SSL support for Flume Thrift source and sink.
  • Navigator SAML support (requires Cloudera Manager).
  • Navigator Key Trustee can now be installed and monitored by Cloudera Manager.
  • Search can be configured to use SSL.
  • Search supports protecting Solr and Lily HBase Indexer metadata using ZooKeeper ACLs in a Kerberos-enabled environment.

Apache Crunch

New HBase-related features:
  • HBaseTypes.cells() was added to support serializing HBase Cell objects.
  • All of the HFileUtils methods now support PCollectionC extends Cell, which includes both PCollectionKeyValue and PCollectionCell, on their method signatures.
  • HFileTarget, HBaseTarget, and HBaseSourceTarget all support any subclass of Cell as an output type. HFileSource and HBaseSourceTarget still return KeyValue as the input type for backward compatibility with existing Crunch pipelines.

Developers can use Cell-based APIs in the same way as KeyValue-based APIs if they are not ready to update their code, but will probably have to change code inside DoFns because HBase 0.99 and later APIs deprecated or removed a number of methods from the HBase 0.96 API.

Apache Flume

CDH 5.4.0 adds SSL and Kerberos support for the Thrift source and sink, and implements DatasetSink 2.0.

Apache Hadoop

HDFS

MapReduce

CDH 5.4.0 implements MAPREDUCE-5785, which simplifies MapReduce job configuration. Instead of having to set both the heap size (mapreduce.map.java.opts or mapreduce.reduce.java.opts) and the container size (mapreduce.map.memory.mb or mapreduce.reduce.memory.mb), you can now choose to set only one of them; the other is inferred from mapreduce.job.heap.memory-mb.ratio. If you do not specify either of them, the container size defaults to 1 GB and the heap size is inferred.

For jobs that do not set the heap size, the JVM size increases from 200 MB to a default 820 MB. This is adequate for most jobs, but streaming tasks might need more memory because the Java process causes total usage to exceed the container size. This typically occurs only for those tasks relying on aggressive garbage collection to keep the heap under 200 MB.

YARN

Apache HBase

CDH 5.4.0 implements HBase 1.0. For detailed information and instructions on how to use the new capabilities, see New Features and Changes for HBase in CDH 5.

MultiWAL Support for HBase

CDH 5.4.0 introduces MultiWAL support for HBase region servers, allowing you to increase throughput when a region writes the write-ahead log (WAL). See Configuring MultiWAL Support.

doAs Impersonation for HBase

CDH 5.4.0 introduces doAs impersonation for the HBase Thrift server. doAs impersonation allows a client to authenticate to HBase as any user, and re-authenticate at any time, instead of as a static user only. See Configure doAs Impersonation for the HBase Thrift Gateway.

Read Replicas for HBase

CDH 5.4.0 introduces read replicas, along with a new timeline consistency model. This feature allows you to balance consistency and availability on a per-read basis, and provides a measure of high availability for reads if a RegionServer becomes unavailable. See HBase Read Replicas.

Storing Medium Objects (MOBs) in HBase

CDH 5.4.0 HBase MOB allows you to store objects up to 10 MB (medium objects, or MOBs) directly in HBase while maintaining read and write performance. See Storing Medium Objects (MOBs) in HBase.

Apache Hive

CDH 5.4.0 implements Hive 1.1.0. New capabilities include:
  • A test-only version of Hive on Spark with the following limitations:
    • Parquet does not currently support vectorization; it simply ignores the setting of hive.vectorized.execution.enabled.
    • Hive on Spark does not yet support dynamic partition pruning.
    • Hive on Spark does not yet support HBase. If you want to interact with HBase, Cloudera recommends that you use Hive on MapReduce.
      Important: Hive on Spark is included in CDH 5.4.0 but is not currently supported nor recommended for production use. If you are interested in this feature, try it out in a test environment until we address the issues and limitations needed for production-readiness.
    To deploy and test Hive on Spark in a test environment, use Cloudera Manager (seeConfiguring Hive on Spark).
  • Support for JAR files changes without scheduled maintenance.
    To implement this capability, proceed as follows:
    1. Set hive.reloadable.aux.jars.path in /etc/hive/conf/hive-site.xml to the directory that contains the JAR files.
    2. Execute the reload; statement on HiveServer2 clients such as Beeline and the Hive JDBC.
  • Beeline support for retrieving and printing query logs.
Some features in the upstream release are not yet supported for production use in CDH; these include:
  • HIVE-7935 - Support dynamic service discovery for HiveServer2
  • HIVE-6455 - Scalable dynamic partitioning and bucketing optimization
  • HIVE-5317 - Implement insert, update, and delete in Hive with full ACID support
  • HIVE-7068 - Integrate AccumuloStorageHandler
  • HIVE-7090 - Support session-level temporary tables in Hive
  • HIVE-7341 - Support for Table replication across HCatalog instances
  • HIVE-4752 - Add support for HiveServer2 to use Thrift over HTTP

Hue

CDH 5.4.0 adds the following:
  • New Oozie editor
  • Performance improvements
  • New Search facets
  • HBase impersonation

Kite

Kite in CDH has been rebased on the 1.0 release upstream. This breaks backward compatibility with existing APIs. The APIs are documented at http://kitesdk.org/docs/1.0.0/apidocs/index.html.

Notable changes are:

  • Dataset writers that implement flush and sync now extend interfaces (Flushable and Syncable), and writers that no longer have misleading flush and sync methods.
  • DatasetReaderException, DatasetWriterException, and DatasetRepositoryException have been removed and replaced with more specific exceptions, such as IncompatibleSchemaException. Exception classes now indicate what went wrong instead of what threw the exception.
  • The partition API is no longer exposed; use the view API instead.
  • kite-data-hcatalog is now kite-data-hive.
  Note:

From 1.0 on, Kite will be strict about breaking compatibility and will use semantic versioning to signal which compatibility guarantees you can expect from a release (for example, incompatible changes require increasing the major version number). For more information, see the Hello, Kite SDK 1.0 blog post.

Apache Oozie

  • Added Spark action which lets you run Spark applications from Oozie workflows. See the Oozie documentation for more details.
  • The Hive2 action now collects and reports Hadoop Job IDs for MapReduce jobs launched by Hive Server 2.
  • The launcher job now uses YARN uber mode for all but the Shell action; this reduces the overhead (time and resources) of running these Oozie actions.

Apache Parquet (incubating)

  • Parquet memory manager now changes the row group size if the current size is expected to cause out-of-memory (OOM) errors because too many files are open. This causes a WARN message to be printed in the logs. A new setting, parquet.memory.pool.ratio, controls the percentage of the JVM's heap memory Parquet attempts to use.
  • To improve job startup time, footers are no longer read by default for MapReduce jobs (PARQUET-139).
      Note:

    To revert to the old behavior (ParquetFileReader reads in all the files to obtain the footers), set parquet.task.side.metadata to false in the job configuration.

  • The Parquet Avro object model can now read lists and maps written by Hive, Avro, and Thrift (similar capabilities were added to Hive in CDH 5.3). This compatibility fix does not change behavior. The extra record layer wrapping the list elements when Avro reads lists written by Hive can now be removed; to do this, set the expected Avro schema or set parquet.avro.add-list-element-records to false.
  • Avro's map representation now writes null values correctly.
  • The Parquet Thrift object model can now read data written by other object models (such as Hive, Impala, or Parquet-Avro), given a Thrift class for the data; compile a Thrift definition into an object, and supply it when creating the job.

Cloudera Search

  • Solr metadata stored in ZooKeeper can now be protected by Zookeeper ACLs. In a Kerberos-enabled environment, Solr metadata stored in ZooKeeper is owned by the solr user and cannot be modified by other users.

      Note:
    • The Solr principal name can be configured in Cloudera Manager. The default name is solr, although other names can be specified.
    • Collection configuration information stored under the /solr/configs znode in not affected by this change. As a result, collection configuration behavior is unchanged.

    Administrators who modify Solr ZooKeeper metadata through operations like solrctl init or solrctl cluster --put-solrxml must now supply solrctl with a JAAS configuration using the --jaas configuration parameter. The JAAS configuration must specify the principal, typically solr, that the solr process uses. See Solrctl Reference for more information.

    End users, who typically do not need to modify Solr metadata, are unaffected by this change.

  • Lily HBase Indexer metadata stored in ZooKeeper can now be protected by Zookeeper ACLs. In a Kerberos-enabled environment, Lily HBase Indexer metadata stored in ZooKeeper is owned by the Solr user and cannot be modified by other users.

    End users, who typically do not manage the Lily HBase Indexer, are unaffected by this change.

  • The Lily HBase Indexer supports restricting access using Sentry. For more information, see Sentry integration.
  • Services included with Search for CDH 5.4.0, including Solr, Key-Value Store Indexer, and Flume, now support SSL.
  • The Spark Indexer and the Lily HBase Batch Indexer support delegation tokens for mapper-only jobs. For more information, see Spark Indexing Reference (CDH 5.2 or later only) and HBaseMapReduceIndexerTool.
  • Search for CDH 5.4.0 implements SOLR-5746, which improves solr.xml file parsing. Error checking for duplicated options or unknown option names was added. These checks can help identify mistakes made during manual edits of the solr.xml file. User-modified solr.xml files may cause errors on startup due to these parsing improvements.
  • By default, CloudSolrServer now uses multiple threads to add documents.

      Note: Note: Due to multithreading, if document addition is interrupted by an exception, some documents, in addition to the one being added when the failure occurred, may be added.

    To get the old, single-threaded behavior, set parallel updates to false on the CloudSolrServer instance.

    Related JIRA: SOLR-4816.

  • Updates are routed directly to the correct shard leader, eliminating document routing at the server. This allows for near linear indexing throughput scalability. Document routing requires that the solrj client must know each document’s unique identifier. The unique identifiers allow the client to route the update directly to the correct shard. For additional information, see Shards and Indexing Data in SolrCloud.

    Related JIRA: SOLR-4816.

  • The loadSolr morphline command supports nested documents. For more information, see Morphlines Reference Guide.
  • Navigator can be used to audit Cloudera Search activity. For more information on the Solr operations that can be audited, see Audit Events and Audit Reports.
  • Search for CDH 5.4 supports logging queries before they are executed. This allows you can identify queries that could increase resource consumption. This also enables improving schemas or filters to meet your performance requirements. To enable this feature, set the SolrCore and SolrCore.Request log level to DEBUG.

    Related JIRA: SOLR-6919

  • UniqFieldsUpdateProcessorFactory, which Solr Server implements, has been improved to support all of the FieldMutatingUpdateProcessorFactory selector options. The <lst named="fields"> init param option is deprecated. Replace this option with <arr name="fieldName">.

    If the <lst named="fields"> init param option is used, Solr logs a warning.

    Related JIRA: SOLR-4249.
  • Configuration information was previously available using FieldMutatingUpdateProcessorFactory (oneOrMany or getBooleanArg). Those methods are now deprecated. The methods have been moved to NamedList and renamed to removeConfigArgs and removeBooleanArg, respectively.

    If the oneOrMany or getBooleanArg methods of FieldMutatingUpdateProcessorFactory are used, Solr logs a warning.

    Related JIRA: SOLR-5264.

Apache Spark

CDH 5.4.0 Spark is rebased on Apache Spark 1.3.0 and provides the following new capabilities:
  • Spark Streaming WAL (write-ahead log) on HDFS, preventing any data loss on driver failure
  • Spark external shuffle service
  • Improvements in automatically setting CDH classpaths for Avro, Parquet, Flume, and Hive
  • Improvements in the collection of task metrics
  • Kafka connector for Spark Streaming to avoid the need for the HDFS WAL
The following is not yet supported in a production environment because of its immaturity:
  • Spark SQL (which now includes dataframes)

See also Apache Spark Known Issues and Apache Spark Incompatible Changes.

Apache Sqoop

  • Sqoop 2:
    • CDH 5.4.0 implements Sqoop 2 version 1.99.5.
    • Sqoop 2 supports Kerberos as of CDH 5.4.0.
    • Sqoop 2 supports PostgreSQL as the repository database.

CDH 5 Requirements and Supported Versions

For the latest information on compatibility across all Cloudera products, see the Product Compatibility Matrix.

Supported Operating Systems

CDH 5 provides packages for Red-Hat-compatible, SLES, Ubuntu, and Debian systems as described below.

Operating System Version Packages
Red Hat Enterprise Linux (RHEL)-compatible
Red Hat Enterprise Linux 5.7 64-bit
  5.10 64-bit
  6.4 64-bit
  6.5 64-bit
  6.5 in SE Linux mode 64-bit
  6.6 64-bit
CentOS 5.7 64-bit
  5.10 64-bit
  6.4 64-bit
  6.5 64-bit
  6.5 in SE Linux mode 64-bit
  6.6 64-bit
Oracle Linux with default kernel and Unbreakable Enterprise Kernel 5.6 (UEK R2) 64-bit
  6.4 (UEK R2) 64-bit
  6.5 (UEK R2, UEK R3) 64-bit
  6.6 (UEK R3) 64-bit
SLES
SUSE Linux Enterprise Server (SLES) 11 with Service Pack 2 64-bit
SUSE Linux Enterprise Server (SLES) 11 with Service Pack 3 64-bit
Ubuntu/Debian
Ubuntu Precise (12.04) - Long-Term Support (LTS) 64-bit
  Trusty (14.04) - Long-Term Support (LTS) 64-bit
Debian Wheezy (7.0) 64-bit
  Note:
  • CDH 5 provides only 64-bit packages.
  • Cloudera has received reports that our RPMs work well on Fedora, but we have not tested this.
  • If you are using an operating system that is not supported by Cloudera packages, you can also download source tarballs from Downloads.

Supported Databases

Component MySQL SQLite PostgreSQL Oracle Derby - see Note 4
Oozie 5.5, 5.6 8.4, 9.2, 9.3

See Note 2

11gR2 Default
Flume Default (for the JDBC Channel only)
Hue 5.5, 5.6

See Note 1

Default 8.4, 9.2, 9.3

See Note 2

11gR2
Hive/Impala 5.5, 5.6

See Note 1

8.4, 9.2, 9.3

See Note 2

11gR2 Default
Sentry 5.5, 5.6

See Note 1

8.4, 9.2, 9.3

See Note 2

11gR2
Sqoop 1 See Note 3 See Note 3 See Note 3
Sqoop 2 See Note 4 See Note 4 See Note 4 Default
  Note:
  1. MySQL 5.5 is supported on CDH 5.1. MySQL 5.6 is supported on CDH 5.1 and later. The InnoDB storage engine must be enabled in the MySQL server.
  2. PostgreSQL 9.2 is supported on CDH 5.1 and later. PostgreSQL 9.3 is supported on CDH 5.2 and later.
  3. For the purposes of transferring data only, Sqoop 1 supports MySQL 5.0 and above, PostgreSQL 8.4 and above, Oracle 10.2 and above, Teradata 13.10 and above, and Netezza TwinFin 5.0 and above. The Sqoop metastore works only with HSQLDB (1.8.0 and higher 1.x versions; the metastore does not work with any HSQLDB 2.x versions).
  4. Sqoop 2 can transfer data to and from MySQL 5.0 and above, PostgreSQL 8.4 and above, Oracle 10.2 and above, and Microsoft SQL Server 2012 and above. The Sqoop 2 repository database is supported only on Derby and PostgreSQL.
  5. Derby is supported as shown in the table, but not always recommended. See the pages for individual components in the Cloudera Installation and Upgrade guide for recommendations.

Supported JDK Versions

CDH 5 is supported with the versions shown in the table that follows.

Latest Certified Version Minimum Supported Version Exceptions
1.7.0_75 1.7.0_75 None
1.8.0_40 1.8.0_40 None

Supported Internet Protocol

CDH requires IPv4. IPv6 is not supported.

See also Configuring Network Names.