Your browser is out of date!

Update your browser to view this website correctly. Update my browser now

×

Please Read and Accept our Terms

Long term component architecture

As the main curator of open standards in Hadoop, Cloudera has a track record of bringing new open source solutions into its platform (such as Apache Spark, Apache HBase, and Apache Parquet) that are eventually adopted by the community at large. As standards, you can build longterm architecture on these components with confidence.

 

PLEASE NOTE:

With the exception of DSSD support, Cloudera Enterprise 5.6.0 is identical to CDH 5.5.2/Cloudera Manager 5.5.3  If you do not need DSSD support, you do not need to upgrade if you are already using the latest 5.5.x release.

 

Note: Mixed operating system type and version clusters are supported, however using the same version of the same operating system on all cluster hosts is strongly recommended.

CDH 5 provides 64-bit packages for RHEL-compatible, SLES, Ubuntu, and Debian systems as listed below.

 

Operating System Version Packages
Red Hat Enterprise Linux (RHEL)-compatible
RHEL (+ SELinux mode in available versions) 5.7 64-bit
  5.10 64-bit
  6.4 64-bit
  6.5 64-bit
  6.6 64-bit
  6.7 64-bit
  7.1 64-bit
  7.2 64-bit
CentOS (+ SELinux mode in available versions) 5.7 64-bit
  5.10 64-bit
  6.4 64-bit
  6.5 64-bit
  6.6 64-bit
  6.7 64-bit
  7.1 64-bit
  7.2 64-bit
Oracle Linux with default kernel and Unbreakable Enterprise Kernel 5.7 (UEK R2) 64-bit
  5.10 64-bit
  5.11 64-bit
  6.4 (UEK R2) 64-bit
  6.5 (UEK R2, UEK R3) 64-bit
  6.6 (UEK R3) 64-bit
  6.7 (UEK R3) 64-bit
  7.1 64-bit
  7.2 64-bit
SLES
SUSE Linux Enterprise Server (SLES) 11 with Service Pack 2 64-bit
SUSE Linux Enterprise Server (SLES) 11 with Service Pack 3 64-bit
SUSE Linux Enterprise Server (SLES) 11 with Service Pack 4 64-bit
Ubuntu/Debian
Ubuntu Precise 12.04 - Long-Term Support (LTS) 64-bit
  Trusty 14.04 - Long-Term Support (LTS) 64-bit
Debian Wheezy 7.0, 7.1, and 7.8 64-bit
 
Important: Cloudera supports RHEL 7 with the following limitations:
 
Note:
  • Cloudera Enterprise is supported on platforms with Security-Enhanced Linux (SELinux) enabled. Cloudera is not responsible for policy support nor policy enforcement. If you experience issues with SELinux, contact your OS provider.
  • CDH 5.7 DataNode hosts with EMC® DSSD™ D5™ are supported by RHEL 6.6, 7.1, and 7.2. CDH 5.6 DataNode hosts with EMC® DSSD™ D5™ are only supported by RHEL 6.6.
Selected tab: SupportedOperatingSystems
Component MariaDB MySQL SQLite PostgreSQL Oracle Derby - see Note 5
Oozie 5.5 5.1, 5.5, 5.6, 5.7 8.1, 8.3, 8.4, 9.1, 9.2, 9.3, 9.4

See Note 3

11gR2, 12c Default
Flume Default (for the JDBC Channel only)
Hue 5.5 5.1, 5.5, 5.6, 5.7

See Note 6

Default 8.1, 8.3, 8.4, 9.1, 9.2, 9.3, 9.4

See Note 3

11gR2, 12c
Hive/Impala 5.5 5.1, 5.5, 5.6, 5.7

See Note 1

8.1, 8.3, 8.4, 9.1, 9.2, 9.3, 9.4

See Note 3

11gR2, 12c Default
Sentry 5.5 5.1, 5.5, 5.6, 5.7

See Note 1

8.1, 8.3, 8.4, 9.1, 9.2, 9.3, 9.4

See Note 3

11gR2, 12c
Sqoop 1 5.5 See Note 4 See Note 4 See Note 4
Sqoop 2 5.5 Default
 
  Note:
  1. MySQL 5.5 is supported on CDH 5.1. MySQL 5.6 is supported on CDH 5.1 and higher. The InnoDB storage engine must be enabled in the MySQL server.
  2. Cloudera Manager installation fails if GTID-based replication is enabled in MySQL.
  3. PostgreSQL 9.2 is supported on CDH 5.1 and higher. PostgreSQL 9.3 is supported on CDH 5.2 and higher. PostgreSQL 9.4 is supported on CDH 5.5 and higher.
  4. For purposes of transferring data only, Sqoop 1 supports MySQL 5.0 and above, PostgreSQL 8.4 and above, Oracle 10.2 and above, Teradata 13.10 and above, and Netezza TwinFin 5.0 and above. The Sqoop metastore works only with HSQLDB (1.8.0 and higher 1.x versions; the metastore does not work with any HSQLDB 2.x versions).
  5. Derby is supported as shown in the table, but not always recommended. See the pages for individual components in the Cloudera Installation and Upgrade guide for recommendations.
  6. CDH 5 Hue requires the default MySQL version of the operating system on which it is being installed, which is usually MySQL 5.1, 5.5, or 5.6.
Selected tab: SupportedDatabases
  Important: JDK 1.6 is not supported on any CDH 5 release (even though the libraries of CDH 5.0-CDH 5.4 are compatible). Applications using CDH libraries must run a supported version of JDK 1.7 or higher, and one that also matches the JDK version of your CDH cluster.
 
CDH 5.7.x is supported with the versions shown in the following table:
Minimum Supported Version Recommended Version Exceptions
1.7.0_55 1.7.0_67, 1.7.0_75, 1.7.0_80 None
1.8.0_31 1.8.0_60 Cloudera recommends that you not use JDK 1.8.0_40.
Selected tab: SupportedJDKVersions

Hue

Hue works with the two most recent versions of the following browsers. Cookies and JavaScript must be on.

  • Chrome
  • Firefox
  • Safari (not supported on Windows)
  • Internet Explorer

Hue could display in older versions and even other browsers, but you might not have access to all of its features.

Selected tab: SupportedBrowsers

 

CDH requires IPv4. IPv6 is not supported.

See also Configuring Network Names.

 

Selected tab: SupportedInternetProtocol

The following components are supported by the indicated versions of Transport Layer Security (TLS):

 

Table 1. Components Supported by TLS

Component

Role Name Port Version
Flume   Avro Source/Sink   TLS 1.2
Flume   Flume HTTP Source/Sink   TLS 1.2
HBase Master HBase Master Web UI Port 60010 TLS 1.2
HDFS NameNode Secure NameNode Web UI Port 50470 TLS 1.2
HDFS Secondary NameNode Secure Secondary NameNode Web UI Port 50495 TLS 1.2
HDFS HttpFS REST Port 14000 TLS 1.1, TLS 1.2
Hive HiveServer2 HiveServer2 Port 10000 TLS 1.2
Hue Hue Server Hue HTTP Port 8888 TLS 1.2
Cloudera Impala Impala Daemon Impala Daemon Beeswax Port 21000 TLS 1.2
Cloudera Impala Impala Daemon Impala Daemon HiveServer2 Port 21050 TLS 1.2
Cloudera Impala Impala Daemon Impala Daemon Backend Port 22000 TLS 1.2
Cloudera Impala Impala Daemon Impala Daemon HTTP Server Port 25000 TLS 1.2
Cloudera Impala Impala StateStore StateStore Service Port 24000 TLS 1.2
Cloudera Impala Impala StateStore StateStore HTTP Server Port 25010 TLS 1.2
Cloudera Impala Impala Catalog Server Catalog Server HTTP Server Port 25020 TLS 1.2
Cloudera Impala Impala Catalog Server Catalog Server Service Port 26000 TLS 1.2
Oozie Oozie Server Oozie HTTPS Port 11443 TLS 1.1, TLS 1.2
Solr Solr Server Solr HTTP Port 8983 TLS 1.1, TLS 1.2
Solr Solr Server Solr HTTPS Port 8985 TLS 1.1, TLS 1.2
YARN ResourceManager ResourceManager Web Application HTTP Port 8090 TLS 1.2
YARN JobHistory Server MRv1 JobHistory Web Application HTTP Port 19890 TLS 1.2
Selected tab: SupportedTransportLayerSecurityVersions
Selected tab: SystemRequirements

What's New in CDH 5.7.0

Operating System Support

  • Operating Systems - Support for:
    • RHEL/CentOS 6.6, 6.7, 7.1, and 7.2
    • Oracle Enterprise Linux (OEL) 7.1 and 7.2
    • SUSE Linux Enterprise Server (SLES) 11 with Service Packs 2, 3, 4
    • Debian: Wheezy 7.0, 7.1, and 7.8
      Important: Cloudera supports RHEL 7 with the following limitations:

Apache Hadoop

  • HDFS-8873 - You can rate-limit the directoryScanner. A new configuration property, dfs.datanode.directoryscan.throttle.limit.ms.per.sec, allows you to reduce the impact on disk performance of directory scanning. The default value is 1000.
  • HDFS-9260 - Garbage collection of full block reports is improved. Data structures for BlockInfo and replicas were changed to keep them sorted. This allows for faster and easier garbage collection of full block reports.
  • HADOOP-12764 - You can configure KMS maxHttpHeaderSize. The default value of KMS maxHttpHeaderSize increased from 4096 to 65536 and is now configurable in service.xml.
  • HADOOP-10651 - Service access can be restricted by IP and hostname. You can now define a whitelist and blacklist per service. The default whitelist is * and the default blacklist is empty.
  • Improve support for heterogeneous storage. Heterogeneous Storage Management (HSM) is now supported natively in Cloudera Manager.

Apache HBase

See also Apache HBase Known Issues.

  • CDH 5.7.0 adds support for a snapshot owner for a table. To configure a table snapshot owner, set the OWNER attribute on the table to a valid HBase user. By default, the table owner is the user who created the table. The table owner and the table creator can restore the table from a snapshot.
  • You can set an HDFS storage policy to store write-ahead logs (WALs) on solid-state drives (SSDs) or a mix of SSDs and spinning disks. See Configuring the Storage Policy for the Write-Ahead Log (WAL).
  • Configuration for snapshot timeouts has been simplified. A single configuration option, hbase.snapshot.master.timeout.millis, controls how long the master waits for a response from the RegionServer before timing out. The default timeout value has been increased from 60,000 milliseconds to 300,000 to accommodate larger tables.
  • You can optionally balance a table's regions by size by setting the hbase.normalizer.enabled property to true. The default value is false. To configure the time interval for the HMaster to check for region balance, set the hbase.normalizer.period property to the interval, in milliseconds. The default value is 1800000, or 30 minutes.
  • You can configure parallel request cancellation for multi-get operations using the hbase.client.replica.interrupt.multiget configuration property, using an advanced configuration snippet in Cloudera Manager, or in hbase-site.xml if you do not use Cloudera Manager.
  • The hbase-spark module has been added, which provides support for using HBase data in Spark jobs using HBaseContext and JavaHBaseContext contexts. See the HBase and Spark chapter of the Apache HBase Reference Guide for details about building a Spark application with HBase support.
  • The REST API now supports creating, reading, updating, and deleting namespaces.
  • If you use the G1 garbage collector, you can disable the BoundedByteBufferPool. See Disabling the BoundedByteBufferPool.
  • The HBase web UI includes graphical tools for managing MemStore and StoreFile details for a region.
  • The HBase web UI displays the number of regions a table has.
  • Two new methods of the Scan API, setColumnFamilyTimeRange and getColumnFamilyTimeRange, allow you to limit Scan results to versions of columns within a specified timestamp range.
  • A new API, MultiRowRangeFilter, allows you to scan multiple row ranges in a single scan. See MultiRowRangeFilter.
  • A new client API, SecurityCapability, enables you to check whether the HBase server supports cell-level security.
  • When using the scan command in HBase Shell, you can use the ROWPREFIXFILTER option to include only rows matching a given prefix, in addition to any other filters you specify. For example:
    hbase> scan 't1', {ROWPREFIXFILTER => 'row2',
                             FILTER => (QualifierFilter (>=, 'binary:xyz')) AND (TimestampsFilter ( 123, 456))}
    
  • The new get_splits HBase Shell command returns the split points for a table. For example:
    hbase> get_splits 't2'
    Total number of splits = 5
    
    => ["", "10", "20", "30", "40"]
    
  • Three new commands relating to the region nomalizer have been added to the HBase Shell:
    • normalizer_enabled checks whether the region normalizer is enabled.
    • normalizer_switch toggles the region normalizer on or off.
    • normalize runs the region normalizer if it is enabled.
  • When configuring a ZooKeeper quorum for HBase, you can now specify the port for each ZooKeeper host separately, instead of using the same port for each ZooKeeper host. The configuration property hbase.zookeeper.clientPort is no longer required. For example:
    <property>
      <name>hbase.zookeeper.quorum</name>
      <value>zk1.example.com:2181,zk2.example.com:20000,zk3.example.com:31111</value>
    </property>
    
  • A new configuration option, hbase.regionserver.hostname, was added, but Cloudera recommends against its use. See Compatibility Notes for CDH 5.7.
  • Two new configuration options allow you to disable loading of all coprocessors or only table-level coprocessors: hbase.coprocessor.enabled (which defaults to true) and hbase.coprocessor.user.enabled (which also defaults to true). Cloudera does not recommend disabling HBase-wide coprocessors, because security functionality is implemented using coprocessors. However, disabling table-level coprocessors may be appropriate in some cases. See Disabling Loading of Coprocessors.
  • A new configuration option, hbase.loadincremental.validate.hfile, allows you to skip validation of HFiles during a bulk load operation when set to false. The default setting is true.
  • The default PermSize for HBase processes is now set to 128 MB. This setting is effective only for JDK 7, because JDK 8 ignores the PermSize setting. To change this setting, edit the HBase Client Environment Advanced Configuration Snippet (Safety Valve) for hbase-env.sh if you use Cloudera Manager or conf/hbase-env.sh otherwise.
  • In CDH 5.6 and lower, the HBaseFsck#checkRegionConsistency() method would throw an IOException if a region repair operation timed out after hbase.hbck.assign.timeout (which defaults to 120 seconds). This exception would the entire hbck operation to fail. In CDH 5.7.0, if the region being repaired is not hbase:meta or another system table, the region is skipped, an error is logged, and the hbck operation continues. This new behavior is disabled by default; to enable it, set the hbase.hbck.skipped.regions.limit option to an integer greater than 0. If more than this number of regions is skipped, the hbck operation fails.
  • Two new options, -exclusive and -disableBalancer, have been added to the hbck utility. The hbck utility now runs without locks unless in fixer mode, and the balancer is only disabled in fixer mode, by default. You can disable these options to retain the old behavior, but Cloudera recommends using the new default behavior.
  • Two new MapReduce jobs, SyncTable and HashTable, allow you to synchronize two different HBase tables that are each receiving live writes. To print usage instructions, run the job with no arguments. These examples show how to run these jobs:
    $ bin/hbase org.apache.hadoop.hbase.mapreduce.SyncTable \
                                                --dryrun=true \
                                                --sourcezkcluster=zk1.example.com,zk2.example.com,zk3.example.com:2181:/hbase \
                                                hdfs://nn:9000/hashes/tableA \
                                                tableA tableA
    
    $ bin/hbase org.apache.hadoop.hbase.mapreduce.HashTable \
                                                --batchsize=32000 \
                                                --numhashfiles=50 \
                                                --starttime=1265875194289 \
                                                --endtime=1265878794289 \
                                                --families=cf2,cf3 \
                                                TestTable /hashes/testTable
    
  • The CopyTable command now allows you to override the org.apache.hadoop.hbase.mapreduce.TableOutputFormat property by prefixing the property keys with the hbase.mapred.output. prefix. For example, hbase.mapred.output.hbase.security.authentication is passed to CopyTable as hbase.security.authentication. This is useful when directing output to a peer cluster with different configuration settings.
  • Several improvements have been made to the HBase canary, including:
    • Sniffing of regions and RegionServers is now parallel to improve performance in large clusters with more than 1000 regions and more than 500 RegionServers.
    • The canary sets cachecheck to false when performing Gets and Scans to avoid influencing the BlockCache.
    • FirstKeyOnlyFilter is used during Gets and Scans to improve performance in a flat wide table.
    • A region is selected at random when sniffing a RegionServer.
    • The sink class used by the canary is now configurable.
    • A new flag, -allRegions, sniffs all regions on a RegionServer if running in RegionServer mode.
  • Distributed log replay has been disabled in CDH 5.7 and higher, due to reliability issues. Distributed log replay is unsupported.
  • A new configuration option, hbase.hfile.drop.behind.compaction, causes the OS-level filesystem cache to be dropped behind compactions. This provides significant performance improvements on large compactions. To disable this behavior, set the option to false. It defaults to true.
  • Two new configuration options, hbase.hstore.compaction.max.size and hbase.hstore.compaction.max.size.offpeak, allow you to specify a maximum compaction size during normal hours and during off-peak hours, if the off-peak feature is used. If unspecified, hbase.hstore.compaction.max.size defaults to 9,223,372,036,854,775,807 (Long.MAX_VALUE), which is essentially unbounded.
  • The HDFS replication factor can now be specified per column family by setting the DFS_REPLICATION attribute of the column family to the desired number of replicas, either at table creation or by altering an existing table schema. If set to 0, the default replication factor is used. If fewer than the desired number of replicas exist, the HDFS FSCK utility reports it.
  • If you use the ChaosMonkey tool, you can low-load a custom ChaosMonkey implementation by passing the class to the -m or --monkey option of the ChaosMonkey tool, in the same way that you would normally pass SLOW or CALM.
  • A new configuration option, hbase.use.dynamic.jars, allows you to disable the dynamic classloader if set to false.

Apache Hive

  • HIVE-9298 - Supports reading alternate timestamp formats. The SerDe property, timestamp.formats, is added to allow you to pass in a comma-delimited list of alternate timestamp formats. For example, the following ALTER TABLE statement (in this case, with Joda date-time parsing) adds: yyyy-MM-dd'T'HH:mm:ss and milliseconds since Unix epoch, represented by the special case pattern millis.
    ALTER TABLE timestamp_formats SET SERDEPROPERTIES ("timestamp.formats"="yyyy-MM-dd'T'HH:mm:ss,millis");
    
  • HIVE-7292 - Hive on Spark (HoS) General Availability. This release supports running Hive jobs using Spark as an additional execution backend. This improves performance for Hive queries, especially those involving multiple reduce stages. See Running Hive on Spark.
  • HIVE-12338 - Add Web UI to HiveServer2. Exposes a Web UI on the HiveServer2 service to surface process metrics (JVM stats, open session, and similar) and per-query runtime information (query plan, performance logs). The Web UI is available on port 10002 by default. See HiveServer2 Web UI.
  • HIVE-10115 - Handle LDAP and Kerberos authentication on the same HiveServer2 interface. In a Kerberized cluster when alternate authentication is enabled on HiveServer2, it accepts Kerberos authentication. Before this enhancement, when LDAP authentication to HiveServer2 was enabled, the service blocked acceptance of other authentication mechanisms.
  • Hive Metastore / HiveServer2 metadata scalability improvements. Fixes to improve scalability and performance include support for DirectSQL by default for metastore operations and incremental memory usage improvements.
  • HIVE-12271 - Add additional catalog and query execution metrics for Hive Metastore / HiveServer2 and integrate with Cloudera Manager. Additional Hive metrics include catalog information (number of databases, tables, and partitions in Hive Metastore) and query execution stats (number of worker threads used, job submission times, and statistics for planning and compilation times). All metrics are available in Cloudera Manager.

Hue

  • Hive metastore service improvements make browsing Hive data faster and easier.
  • SQL editor improvements:
    • SQL assist scales to thousands of tables and databases.
    • Improved row and column headers.
    • Button to delete the query history.
    • Button to format SQL.
  • Security improvements:
  • Oozie application improvements:
    • Display graph of external workflow.
    • Option to dry run jobs on submission.
    • User timezone recognition.
    • Automatic email on failure.
    • Ability to execute individual actions.

Apache Impala (incubating)

MapReduce

  • The mapred job -history command provides an efficient way to fetch history for MapReduce jobs.

Apache Oozie

  • OOZIE-2411 - The Oozie email action now supports BCC:, as well as To: and CC:.

Cloudera Search

  • Improvements to loadSolr. For additional information about these improvements, see the Morphlines Reference Guide. Improvements include:
    • loadSolr retries SolrJ requests if exceptions, such as connection resets, occur. This is useful in cases such as temporary Solr server overloads or transient connectivity failures. This behavior is enabled by default, but can be disabled by setting the Java system property org.kitesdk.morphline.solr.LoadSolrBuilder.disableRetryPolicyByDefault=true.
    • loadSolr includes an optional rate-limiting parameter that is used to set the maximum number of morphline records to ingest per second. The default value is no limit.

Apache Spark

  • Spark is rebased on Apache Spark 1.6.0.
  • SPARK-10000 - Spark 1.6.0 includes a new unified memory manager. The new memory manager is turned off by default (unlike Apache Spark 1.6.0), to make it easier for users to migrate existing workloads, but it is supported.
  • SPARK-9999 - Spark 1.6.0 introduces a new Dataset API. However this API is experimental, likely to undergo some changes, and unsupported.
  • SPARK-6028 and SPARK-6230 - Encryption support for Spark RPC
  • SPARK-2750 - HTTPS support for History Server and web UI
  • Added support for Spark SQL (DataFrames) in PySpark.
  • Added support for the following MLlib features:
    • spark.ml
    • ML pipeline APIs
  • The hbase-spark module has been added, which provides support for using HBase data in Spark jobs using HBaseContext and JavaHBaseContext contexts. See the HBase and Spark chapter of the Apache HBase Reference Guide for details about building a Spark application with HBase support.

YARN

  • Preemption guarantees that important tasks are not starved for resources, while allowing the CDH cluster to be used for experimental and research tasks. In FairScheduler, you can now disable preemption for a specific queue to ensure that its resources are not taken for other tasks.
Selected tab: WhatsNew

Want to Get Involved or Learn More?

Check out our other resources

Cloudera Community

Collaborate with your peers, industry experts, and Clouderans to make the most of your investment in Hadoop.

Cloudera University

Receive expert Hadoop training through Cloudera University, the industry's only truly dynamic Hadoop training curriculum that’s updated regularly to reflect the state of the art in big data.