Long term component architecture
As the main curator of open standards in Hadoop, Cloudera has a track record of bringing new open source solutions into its platform (such as Apache Spark, Apache HBase, and Apache Parquet) that are eventually adopted by the community at large. As standards, you can build longterm architecture on these components with confidence.
With the exception of DSSD support, Cloudera Enterprise 5.6.0 is identical to CDH 5.5.2/Cloudera Manager 5.5.3 If you do not need DSSD support, you do not need to upgrade if you are already using the latest 5.5.x release.
- System Requirements
- What's New
- Supported Operating Systems
- Supported Databases
- Supported JDK Versions
- Supported Internet Protocol
Supported Operating Systems
|Component||MySQL||SQLite||PostgreSQL||Oracle||Derby - see Note 4|
|Oozie||5.5, 5.6||-||8.4, 9.2, 9.3
See Note 2
|Flume||-||-||-||-||Default (for the JDBC Channel only)|
See Note 1
|Default||8.4, 9.2, 9.3
See Note 2
See Note 1
|-||8.4, 9.2, 9.3
See Note 2
See Note 1
|-||8.4, 9.2, 9.3
See Note 2
|Sqoop 1||See Note 3||-||See Note 3||See Note 3||-|
|Sqoop 2||See Note 4||-||See Note 4||See Note 4||Default|
- MySQL 5.5 is supported on CDH 5.1. MySQL 5.6 is supported on CDH 5.1 and later. The InnoDB storage engine must be enabled in the MySQL server.
- PostgreSQL 9.2 is supported on CDH 5.1 and later. PostgreSQL 9.3 is supported on CDH 5.2 and later.
- For the purposes of transferring data only, Sqoop 1 supports MySQL 5.0 and above, PostgreSQL 8.4 and above, Oracle 10.2 and above, Teradata 13.10 and above, and Netezza TwinFin 5.0 and above. The Sqoop metastore works only with HSQLDB (1.8.0 and higher 1.x versions; the metastore does not work with any HSQLDB 2.x versions).
- Sqoop 2 can transfer data to and from MySQL 5.0 and above, PostgreSQL 8.4 and above, Oracle 10.2 and above, and Microsoft SQL Server 2012 and above. The Sqoop 2 repository database is supported only on Derby and PostgreSQL.
- Derby is supported as shown in the table, but not always recommended. See the pages for individual components in the Cloudera Installation and Upgrade guide for recommendations.
Supported JDK Versions
|Minimum Supported Version||Recommended Version||Notes|
|1.7.0_55||1.7.0_67 or 1.7.0_75||None|
Supported Internet Protocol
What's New in CDH 5.4.0
Upgrading to CDH 5.4.0 and later from any earlier release requires an HDFS metadata upgrade.
- If you are using Cloudera Manager to upgrade CDH, see Upgrading CDH and Managed Services Using Cloudera Manager.
- If you are running an earlier CDH 5 release and have an Enterprise License, you can perform a rolling upgrade: see Performing a Rolling Upgrade on a CDH 5 Cluster.
- If you are not using Cloudera Manager, see Upgrading Unmanaged CDH Using the Command Line.
Be careful to follow all of the upgrade steps as instructed.
For the latest Impala features, see New Features in Impala Version 2.2.0 / CDH 5.4.0.
Operating System Support
CDH 5.4.0 adds support for RHEL and CentOS 6.6. See CDH 5 Requirements and Supported Versions.
The following summarizes new security capabilities in CDH 5.4.0:
- Secure Hue impersonation support for the Hue HBase application.
- Redaction of sensitive data from logs, centrally managed by Cloudera Manager, which prevents the WHERE clause in queries from leaking sensitive data into logs and management UIs.
- Cloudera Manager support for custom Kerberos principals.
- Kerberos support for Sqoop 2.
- Kerberos and TLS/SSL support for Flume Thrift source and sink.
- Navigator SAML support (requires Cloudera Manager).
- Navigator Key Trustee can now be installed and monitored by Cloudera Manager.
- Search can be configured to use SSL.
- Search supports protecting Solr and Lily HBase Indexer metadata using ZooKeeper ACLs in a Kerberos-enabled environment.
New HBase-related features:
- HBaseTypes.cells() was added to support serializing HBase Cell objects.
- All of the HFileUtils methods now support PCollectionC extends Cell, which includes both PCollectionKeyValue and PCollectionCell, on their method signatures.
- HFileTarget, HBaseTarget, and HBaseSourceTarget all support any subclass of Cell as an output type. HFileSource andHBaseSourceTarget still return KeyValue as the input type for backward compatibility with existing Crunch pipelines.
Developers can use Cell-based APIs in the same way as KeyValue-based APIs if they are not ready to update their code, but will probably have to change code inside DoFns because HBase 0.99 and later APIs deprecated or removed a number of methods from the HBase 0.96 API.
CDH 5.4.0 adds SSL and Kerberos support for the Thrift source and sink, and implements DatasetSink 2.0.
- CDH 5.4.0 implements HDFS 2.6.0.
- CDH 5.4.0 HDFS provides hot-swap capability for DataNode disk drives. You can add or replace HDFS data volumes without shutting down the DataNode host (HDFS-1362); see Performing Disk Hot Swap for DataNodes.
- CDH 5.4.0 introduces cluster-wide redaction of sensitive data in logs and SQL queries. See Sensitive Data Redaction.
- CDH 5.4.0 adds support for Heterogenous Storage Policies.
CDH 5.4.0 implements MAPREDUCE-5785, which simplifies MapReduce job configuration. Instead of having to set both the heap size (mapreduce.map.java.opts or mapreduce.reduce.java.opts) and the container size (mapreduce.map.memory.mb ormapreduce.reduce.memory.mb), you can now choose to set only one of them; the other is inferred from mapreduce.job.heap.memory-mb.ratio. If you do not specify either of them, the container size defaults to 1 GB and the heap size is inferred.
For jobs that do not set the heap size, the JVM size increases from 200 MB to a default 820 MB. This is adequate for most jobs, but streaming tasks might need more memory because the Java process causes total usage to exceed the container size. This typically occurs only for those tasks relying on aggressive garbage collection to keep the heap under 200 MB.
- YARN-2990 improves application launch time by 6 seconds when using FairScheduler (with the default Cloudera Manager settings shown in YARN (MR2 Included) Properties in CDH 5.4.0).
CDH 5.4.0 implements HBase 1.0. For detailed information and instructions on how to use the new capabilities, see New Features and Changes for HBase in CDH 5.
MultiWAL Support for HBase
CDH 5.4.0 introduces MultiWAL support for HBase region servers, allowing you to increase throughput when a region writes the write-ahead log (WAL). SeeConfiguring MultiWAL Support.
doAs Impersonation for HBase
CDH 5.4.0 introduces doAs impersonation for the HBase Thrift server. doAs impersonation allows a client to authenticate to HBase as any user, and re-authenticate at any time, instead of as a static user only. See Configure doAs Impersonation for the HBase Thrift Gateway.
Read Replicas for HBase
CDH 5.4.0 introduces read replicas, along with a new timeline consistency model. This feature allows you to balance consistency and availability on a per-read basis, and provides a measure of high availability for reads if a RegionServer becomes unavailable. See HBase Read Replicas.
Storing Medium Objects (MOBs) in HBase
CDH 5.4.0 HBase MOB allows you to store objects up to 10 MB (medium objects, or MOBs) directly in HBase while maintaining read and write performance. See Storing Medium Objects (MOBs) in HBase.
CDH 5.4.0 implements Hive 1.1.0. New capabilities include:
- A test-only version of Hive on Spark with the following limitations:
- Parquet does not currently support vectorization; it simply ignores the setting of hive.vectorized.execution.enabled.
- Hive on Spark does not yet support dynamic partition pruning.
- Hive on Spark does not yet support HBase. If you want to interact with HBase, Cloudera recommends that you use Hive on MapReduce.
Important: Hive on Spark is included in CDH 5.4.0 but is not currently supported nor recommended for production use. If you are interested in this feature, try it out in a test environment until we address the issues and limitations needed for production-readiness.To deploy and test Hive on Spark in a test environment, use Cloudera Manager (seeConfiguring Hive on Spark).
- Support for JAR files changes without scheduled maintenance.
To implement this capability, proceed as follows:
- Set hive.reloadable.aux.jars.path in /etc/hive/conf/hive-site.xml to the directory that contains the JAR files.
- Execute the reload; statement on HiveServer2 clients such as Beeline and the Hive JDBC.
- Beeline support for retrieving and printing query logs.
Some features in the upstream release are not yet supported for production use in CDH; these include:
- HIVE-7935 - Support dynamic service discovery for HiveServer2
- HIVE-6455 - Scalable dynamic partitioning and bucketing optimization
- HIVE-5317 - Implement insert, update, and delete in Hive with full ACID support
- HIVE-7068 - Integrate AccumuloStorageHandler
- HIVE-7090 - Support session-level temporary tables in Hive
- HIVE-7341 - Support for Table replication across HCatalog instances
- HIVE-4752 - Add support for HiveServer2 to use Thrift over HTTP
CDH 5.4.0 adds the following:
- New Oozie editor
- Performance improvements
- New Search facets
- HBase impersonation
Kite in CDH has been rebased on the 1.0 release upstream. This breaks backward compatibility with existing APIs. The APIs are documented athttp://kitesdk.org/docs/1.0.0/apidocs/index.html.
Notable changes are:
- Dataset writers that implement flush and sync now extend interfaces (Flushable and Syncable), and writers that no longer have misleading flush and sync methods.
- DatasetReaderException, DatasetWriterException, and DatasetRepositoryException have been removed and replaced with more specific exceptions, such as IncompatibleSchemaException. Exception classes now indicate what went wrong instead of what threw the exception.
- The partition API is no longer exposed; use the view API instead.
- kite-data-hcatalog is now kite-data-hive.
From 1.0 on, Kite will be strict about breaking compatibility and will use semantic versioning to signal which compatibility guarantees you can expect from a release (for example, incompatible changes require increasing the major version number). For more information, see the Hello, Kite SDK 1.0 blog post.
- Added Spark action which lets you run Spark applications from Oozie workflows. See the Oozie documentation for more details.
- The Hive2 action now collects and reports Hadoop Job IDs for MapReduce jobs launched by Hive Server 2.
- The launcher job now uses YARN uber mode for all but the Shell action; this reduces the overhead (time and resources) of running these Oozie actions.
Apache Parquet (incubating)
- Parquet memory manager now changes the row group size if the current size is expected to cause out-of-memory (OOM) errors because too many files are open. This causes a WARN message to be printed in the logs. A new setting, parquet.memory.pool.ratio, controls the percentage of the JVM's heap memory Parquet attempts to use.
- To improve job startup time, footers are no longer read by default for MapReduce jobs (PARQUET-139).
To revert to the old behavior (ParquetFileReader reads in all the files to obtain the footers), set parquet.task.side.metadata to false in the job configuration.
- The Parquet Avro object model can now read lists and maps written by Hive, Avro, and Thrift (similar capabilities were added to Hive in CDH 5.3). This compatibility fix does not change behavior. The extra record layer wrapping the list elements when Avro reads lists written by Hive can now be removed; to do this, set the expected Avro schema or set parquet.avro.add-list-element-records to false.
- Avro's map representation now writes null values correctly.
- The Parquet Thrift object model can now read data written by other object models (such as Hive, Impala, or Parquet-Avro), given a Thrift class for the data; compile a Thrift definition into an object, and supply it when creating the job.
Solr metadata stored in ZooKeeper can now be protected by Zookeeper ACLs. In a Kerberos-enabled environment, Solr metadata stored in ZooKeeper is owned by the solr user and cannot be modified by other users.
- The Solr principal name can be configured in Cloudera Manager. The default name is solr, although other names can be specified.
- Collection configuration information stored under the /solr/configs znode in not affected by this change. As a result, collection configuration behavior is unchanged.
Administrators who modify Solr ZooKeeper metadata through operations like solrctl init or solrctl cluster --put-solrxml must now supplysolrctl with a JAAS configuration using the --jaas configuration parameter. The JAAS configuration must specify the principal, typically solr, that the solr process uses. See Solrctl Reference for more information.
End users, who typically do not need to modify Solr metadata, are unaffected by this change.
Lily HBase Indexer metadata stored in ZooKeeper can now be protected by Zookeeper ACLs. In a Kerberos-enabled environment, Lily HBase Indexer metadata stored in ZooKeeper is owned by the Solr user and cannot be modified by other users.
End users, who typically do not manage the Lily HBase Indexer, are unaffected by this change.
- The Lily HBase Indexer supports restricting access using Sentry. For more information, see Sentry integration.
- Services included with Search for CDH 5.4.0, including Solr, Key-Value Store Indexer, and Flume, now support SSL.
- The Spark Indexer and the Lily HBase Batch Indexer support delegation tokens for mapper-only jobs. For more information, see Spark Indexing Reference (CDH 5.2 or later only) and HBaseMapReduceIndexerTool.
- Search for CDH 5.4.0 implements SOLR-5746, which improves solr.xml file parsing. Error checking for duplicated options or unknown option names was added. These checks can help identify mistakes made during manual edits of the solr.xml file. User-modified solr.xml files may cause errors on startup due to these parsing improvements.
By default, CloudSolrServer now uses multiple threads to add documents.
Note: Note: Due to multithreading, if document addition is interrupted by an exception, some documents, in addition to the one being added when the failure occurred, may be added.
To get the old, single-threaded behavior, set parallel updates to false on the CloudSolrServer instance.
Related JIRA: SOLR-4816.
Updates are routed directly to the correct shard leader, eliminating document routing at the server. This allows for near linear indexing throughput scalability. Document routing requires that the solrj client must know each document’s unique identifier. The unique identifiers allow the client to route the update directly to the correct shard. For additional information, see Shards and Indexing Data in SolrCloud.
Related JIRA: SOLR-4816.
- The loadSolr morphline command supports nested documents. For more information, see Morphlines Reference Guide.
- Navigator can be used to audit Cloudera Search activity. For more information on the Solr operations that can be audited, see Audit Events and Audit Reports.
Search for CDH 5.4 supports logging queries before they are executed. This allows you can identify queries that could increase resource consumption. This also enables improving schemas or filters to meet your performance requirements. To enable this feature, set the SolrCore and SolrCore.Request log level to DEBUG.
Related JIRA: SOLR-6919
UniqFieldsUpdateProcessorFactory, which Solr Server implements, has been improved to support all of theFieldMutatingUpdateProcessorFactory selector options. The <lst named="fields"> init param option is deprecated. Replace this option with<arr name="fieldName">.
If the <lst named="fields"> init param option is used, Solr logs a warning.Related JIRA: SOLR-4249.
Configuration information was previously available using FieldMutatingUpdateProcessorFactory (oneOrMany or getBooleanArg). Those methods are now deprecated. The methods have been moved to NamedList and renamed to removeConfigArgs and removeBooleanArg, respectively.
If the oneOrMany or getBooleanArg methods of FieldMutatingUpdateProcessorFactory are used, Solr logs a warning.Related JIRA: SOLR-5264.
CDH 5.4.0 Spark is rebased on Apache Spark 1.3.0 and provides the following new capabilities:
- Spark Streaming WAL (write-ahead log) on HDFS, preventing any data loss on driver failure
- Spark external shuffle service
- Improvements in automatically setting CDH classpaths for Avro, Parquet, Flume, and Hive
- Improvements in the collection of task metrics
- Kafka connector for Spark Streaming to avoid the need for the HDFS WAL
The following is not yet supported in a production environment because of its immaturity:
- Spark SQL (which now includes dataframes)
- Sqoop 2:
- CDH 5.4.0 implements Sqoop 2 version 1.99.5.
- Sqoop 2 supports Kerberos as of CDH 5.4.0.
- Sqoop 2 supports PostgreSQL as the repository database.
Want to Get Involved or Learn More?
Check out our other resources
Receive expert Hadoop training through Cloudera University, the industry's only truly dynamic Hadoop training curriculum that’s updated regularly to reflect the state of the art in big data.