Long term component architecture
As the main curator of open standards in Hadoop, Cloudera has a track record of bringing new open source solutions into its platform (such as Apache Spark™, Apache HBase, and Apache Parquet) that are eventually adopted by the community at large. As standards, you can build long term architecture on these components with confidence.
- System Requirements
- What's New
- Supported Operating Systems
- Supported Databases
- Supported JDK Versions
- Supported Browsers
- Supported Internet Protocol
- Supported Transport Layer Security Versions
Supported Operating Systems
Please see Cloudera Manager Supported Databases for a full list of supported databases for each version of Cloudera Manager.
Cloudera Manager and CDH come packaged with an embedded PostgreSQL database, but it is recommended that you configure your cluster with custom external databases, especially in production.
In most cases (but not all), Cloudera supports versions of MariaDB, MySQL and PostgreSQL that are native to each supported Linux distribution.
After installing a database, upgrade to the latest patch and apply appropriate updates. Available updates may be specific to the operating system on which it is installed.
- Use UTF8 encoding for all custom databases.
- Cloudera Manager installation fails if GTID-based replication is enabled in MySQL.
- Hue requires the default MySQL/MariaDB version (if used) of the operating system on which it is installed. See Hue Databases.
- Both the Community and Enterprise versions of MySQL are supported, as well as MySQL configured by the AWS RDS service.
Important: When you restart processes, the configuration for each of the services is redeployed using information saved in the Cloudera Manager database. If this information is not available, your cluster does not start or function correctly. You must schedule and maintain regular backups of the Cloudera Manager database to recover the cluster in the event of the loss of this database.
Supported JDK Versions
Unless specifically excluded, support for a minor JDK release begins from the Cloudera major release in which support for the major JDK release was added. For example, 8u102 was released in time for C5.9 but is actually supported from C5.3 because that is when support for JDK 1.8 was added. Cloudera excludes or removes support for select Java updates when security is jeopardized.
Running CDH nodes within the same cluster on different JDK releases is not supported. JDK release across a cluster needs to match the patch level.
- All nodes in your cluster must run the same Oracle JDK version.
- All services must be deployed on the same Oracle JDK version.
All JDK 7 updates, from the minimum required version, are supported in CM/CDH 5.0 and higher unless specifically excluded. Updates above the minimum that are not listed are supported but not tested.
The Cloudera Manager repository is packaged with Oracle JDK 1.7.0_67 (for example) and can be automatically installed during a new installation or an upgrade.
JDK 7 updates that are supported and tested
|JDK 7||Supported in all C5.x|
|1.7u80||Recommended / Latest version supported|
All JDK 8 updates, from the minimum required version, are supported in CM/CDH 5.3 and higher unless specifically excluded. Updates above the minimum that are not listed are supported but not tested.
Warning: JDK 8u40, 8u45, and 8u60 are excluded from support due to a security risk: HTTP authentication can fail for web-based UI components such as HDFS, YARN, SOLR, and Oozie.Important: JDK 8u75 is supported but has a Known Issue: Oozie Web Console returns 500 error when Oozie server runs on JDK 8u75 or higher.
JDK 8 updates that are supported and tested
|JDK 8||Supported in C5.3 and Higher|
|1.8u121||Recommended / Latest version supported|
- Chrome: Version history
- Firefox: Version history
- Internet Explorer: Version history
- Safari (Mac only): Version history
Hue can display in older, and other, browsers, but you might not have access to all of its features.
Important: To see all icons in the Hue Web UI, users with IE and HTTPS must add a Load Balancer.
Supported Internet Protocol
CDH requires IPv4. IPv6 is not supported.
See also Configuring Network Names.
Multihoming CDH or Cloudera Manager is not supported outside specifically certified Cloudera partner appliances. Cloudera finds that current Hadoop architectures combined with modern network infrastructures and security practices remove the need for multihoming. Multihoming, however, is beneficial internally in appliance form factors to take advantage of high-bandwidth InfiniBand interconnects.
Although some subareas of the product may work with unsupported custom multihoming configurations, there are known issues with multihoming. In addition, unknown issues may arise because multihoming is not covered by our test matrix outside the Cloudera-certified partner appliances.
Supported Transport Layer Security Versions
The following components are supported by the indicated versions of Transport Layer Security (TLS):
Components Supported by TLS
|Cloudera Manager||Cloudera Manager Server||7182||TLS 1.2|
|Cloudera Manager||Cloudera Manager Server||7183||TLS 1.2|
|Flume||Avro Source/Sink||TLS 1.2|
|Flume||Flume HTTP Source/Sink||TLS 1.2|
|HBase||Master||HBase Master Web UI Port||60010||TLS 1.2|
|HDFS||NameNode||Secure NameNode Web UI Port||50470||TLS 1.2|
|HDFS||Secondary NameNode||Secure Secondary NameNode Web UI Port||50495||TLS 1.2|
|HDFS||HttpFS||REST Port||14000||TLS 1.1, TLS 1.2|
|Hive||HiveServer2||HiveServer2 Port||10000||TLS 1.2|
|Hue||Hue Server||Hue HTTP Port||8888||TLS 1.2|
|Impala||Impala Daemon||Impala Daemon Beeswax Port||21000||TLS 1.2|
|Impala||Impala Daemon||Impala Daemon HiveServer2 Port||21050||TLS 1.2|
|Impala||Impala Daemon||Impala Daemon Backend Port||22000||TLS 1.2|
|Impala||Impala StateStore||StateStore Service Port||24000||TLS 1.2|
|Impala||Impala Daemon||Impala Daemon HTTP Server Port||25000||TLS 1.2|
|Impala||Impala StateStore||StateStore HTTP Server Port||25010||TLS 1.2|
|Impala||Impala Catalog Server||Catalog Server HTTP Server Port||25020||TLS 1.2|
|Impala||Impala Catalog Server||Catalog Server Service Port||26000||TLS 1.2|
|Oozie||Oozie Server||Oozie HTTPS Port||11443||TLS 1.1, TLS 1.2|
|Solr||Solr Server||Solr HTTP Port||8983||TLS 1.1, TLS 1.2|
|Solr||Solr Server||Solr HTTPS Port||8985||TLS 1.1, TLS 1.2|
|Spark||History Server||18080||TLS 1.2|
|YARN||ResourceManager||ResourceManager Web Application HTTP Port||8090||TLS 1.2|
|YARN||JobHistory Server||MRv1 JobHistory Web Application HTTP Port||19890||TLS 1.2|
What's New in CDH 5.12.x
Apache Hive / Hive-on-Spark
Support for Microsoft Azure Data Lake Store (ADLS) as a secondary filesystem for both Hive on MapReduce2 (YARN) and Hive-on-Spark. You can now use both Hive on MapReduce2 and Hive-on-Spark to read and write data stored on ADLS.
The Hive schematool is integrated with Cloudera Manager where you can use it to upgrade or validate the Hive metastore schema.
See Using the Hive Schema Tool for details.
HIVE-1575: Added support for JSON arrays at the root level by the get_json_object function. For example:
SELECT get_json_object('[1,2,3]', '$')...
Hue 4 is out and jam-packed with great new features.
New Layout in Hue 4!
- Apps are consolidated under blue button–set your favorite as default landing page
- Top search bar lets you search for saved queries and other data
- Left and right assist panels let you search and filter schema objects
- Cursor position determines which of multiple queries to run
- New Pig editor, Job Designer, and Job Browser
- Access old Hue 3 layout under user drop down or remove "hue" from URL.
Load Balancer Added by Default
- During a new installation of CDH/Hue, one Load Balancer is automatically promoted to ensure optimal performance–it can reduce the Hue server load by up to 90%! In existing clusters, administrators are prompted to add a load balancer role and users are then guided on how to enable it. See the Cloudera Blog on Automatic HA.
Test LDAP Configuration
- Verify your LDAP configuration, on-the-fly, with this new feature in Cloudera Manager under Hue > Actions> Test LDAP Configuration. See Authenticate Hue with LDAP.
Navigator Optimizer Integrated (Phase 1)
- With Navigator Optimizer enabled in Hue, popular tables, columns, joins, filters are displayed in the autocompleter. Risky statements, such as missing filters on partitioned tables, trigger an alert.
Navigator Search & Tag Enabled by Default
- With Navigator enabled in Hue, you can search and tag metadata. This feature is now enabled by default (with Cloudera Navigator installed). See How to Enable and Use Navigator in Hue.
Other Cool Features
- You can create partitioned tables from files
- Impala metadata is refreshed automatically
- SQL autocompleter handles more advanced corner cases
- Remote Load balancer works with SSL
- Query history is paginated!
Apache Impala (incubating)
The following are some of the most significant new Impala features in this release:
Impala can now read and write data stored on the Microsoft Azure Data Lake Store (ADLS).
New built-in functions:
A new string function, replace(), which is faster than regexp_replace() for simple string substitutions. See Impala String Functions for details.
A new conditional function, nvl2(), which offers more flexibility than the nvl() function. It lets you return one value for NOT NULL arguments, and a different value for NULL arguments. See Impala Conditional Functions for details.
New syntax, REFRESH FUNCTIONS db_name, lets Impala recognize newly added functions, such as UDFs created through Hive. Impala scans the metadata for a specified database to locate the new functions, which is faster and more convenient than doing a full INVALIDATE METADATA operation.
Startup flags for the impalad daemon, is_executor and is_coordinator, let you divide the work on a large, busy cluster between a small number of hosts acting as query coordinators, and a larger number of hosts acting as query executors. By default, each host can act in both roles, potentially introducing bottlenecks during heavily concurrent workloads. See Controlling which Hosts are Coordinators and Executors for details.
A new query option, DEFAULT_JOIN_DISTRIBUTION_MODE, lets you change the default assumption about how join queries should handle tables with no statistics. This can help to avoid out-of-memory conditions for join queries, without manual tuning to add the /* +SHUFFLE */ hint for queries on large tables with missing statistics.
The SORT BY clause lets you create Parquet files with more efficient compression and smaller ranges of values for specified columns, allowing Impala to apply optimizations to skip reading data from Parquet files that do not contain any values that match equality and range operators in the WHERE clause. SeeCREATE TABLE Statement for details.
The max_audit_event_log_files lets you perform log rotation for the audit event log files, similar to the rotation for regular Impala log files.
The ALTER TABLE statement can specify more attributes for a Kudu table with the ADD COLUMNS clause. Now you can specify [NOT] NULL, ENCODING COMPRESSION, DEFAULT, and BLOCK_SIZE. See ALTER TABLE Statement for details.
The TIMESTAMP type is now available for Kudu tables.Note: See Handling Date, Time, or Timestamp Data with Kudu for information about the tradeoffs between performance and convenience when using this data type. For high-performance applications, you might continue to use the BIGINT type to represent date/time values.
The INSERT and CREATE TABLE AS SELECT statements are more efficient when writing to Kudu tables. Formerly, the overhead for the write operations could result in timeouts when writing large numbers of rows in a single operation.
- Apache HBase now has ADLS support and recommendations for Azure deployment.
- Outside of cloud, HBase now has support for long-lived Spark applications via token renewal.
Spark can read and write data on the Azure Data Lake Store (ADLS) cloud service. See Accessing Data Stored in Azure Data Lake Store (ADLS) through Spark for details.
Want to Get Involved or Learn More?
Check out our other resources
Receive expert Hadoop training through Cloudera University, the industry's only truly dynamic Hadoop training curriculum that’s updated regularly to reflect the state of the art in big data.