The recommended tool for installing Cloudera Enterprise
This download installs Cloudera Enterprise or Cloudera Express.
Cloudera Enterprise requires a license; however, when installing Cloudera Express you will have the option to unlock Cloudera Enterprise features for a free 60-day trial.
Once the trial has concluded, the Cloudera Enterprise features will be disabled until you obtain and upload a license.
- System Requirements
- What's New
- Supported Operating Systems
- Supported JDK Versions
- Supported Browsers
- Supported Databases
- Supported CDH and Managed Service Versions
- Supported Transport Layer Security Versions
- Resource Requirements
- Networking and Security Requirements
Supported Operating Systems
Supported JDK Versions
The version of Oracle JDK supported by Cloudera Manager depends on the version of CDH being managed. For more information see CDH and Cloudera Manager Supported JDK Versions.
The Cloudera Manager repository is packaged with Oracle JDK 1.7.0_67 and can be automatically installed during a new installation or an upgrade. If you prefer to install the JDK yourself, follow the instructions in Java Development Kit Installation.
The Cloudera Manager Admin Console, which you use to install, configure, manage, and monitor services, supports the latest version of the following browsers:
- Mozilla Firefox
- Google Chrome
- Internet Explorer
Please see Cloudera Manager Supported Databases for a full list of supported databases for each version of Cloudera Manager.
Cloudera Manager and CDH come packaged with an embedded PostgreSQL database, but it is recommended that you configure your cluster with custom external databases, especially in production.
In most cases (but not all), Cloudera supports versions of MariaDB, MySQL and PostgreSQL that are native to each supported Linux distribution.
After installing a database, upgrade to the latest patch and apply appropriate updates. Available updates may be specific to the operating system on which it is installed.
- Use UTF8 encoding for all custom databases.
- Cloudera Manager installation fails if GTID-based replication is enabled in MySQL.
- Hue requires the default MySQL/MariaDB version (if used) of the operating system on which it is installed. See Hue Databases.
- Both the Community and Enterprise versions of MySQL are supported, as well as MySQL configured by the AWS RDS service.
Important: When you restart processes, the configuration for each of the services is redeployed using information saved in the Cloudera Manager database. If this information is not available, your cluster does not start or function correctly. You must schedule and maintain regular backups of the Cloudera Manager database to recover the cluster in the event of the loss of this database.
Supported CDH and Managed Service Versions
The following versions of CDH and managed services are supported:
Warning: Cloudera Manager 5 does not support CDH 3 and you cannot upgrade Cloudera Manager 4 to Cloudera Manager 5 if you have a cluster running CDH 3. Therefore, to upgrade CDH 3 clusters to CDH 4 using Cloudera Manager, you must use Cloudera Manager 4.
- CDH 4 and CDH 5. The latest released versions of CDH 4 and CDH 5 are strongly recommended. For information on CDH 4 requirements, see CDH 4 Requirements and Supported Versions. For information on CDH 5 requirements, see CDH 5 Requirements and Supported Versions.
- Cloudera Impala - Cloudera Impala is included with CDH 5. Cloudera Impala 1.2.1 with CDH 4.1.0 or higher. For more information on Impala requirements with CDH 4, see Impala Requirements.
- Cloudera Search - Cloudera Search is included with CDH 5. Cloudera Search 1.2.0 with CDH 4.6.0. For more information on Cloudera Search requirements with CDH 4, see Cloudera Search Requirements.
- Apache Spark - 0.90 or higher with CDH 4.4.0 or higher.
- Apache Accumulo - 1.4.3 with CDH 4.3.0, 1.4.4 with CDH 4.5.0, and 1.6.0 with CDH 4.6.0.
For more information, see the Product Compatibility Matrix.
Supported Transport Layer Security Versions
Cloudera Manager requires the following resources:
- Disk Space
- Cloudera Manager Server
- 5 GB on the partition hosting /var.
- 500 MB on the partition hosting /usr.
- For parcels, the space required depends on the number of parcels you download to the Cloudera Manager Server and distribute to Agent hosts. You can download multiple parcels of the same product, of different versions and different builds. If you are managing multiple clusters, only one parcel of a product/version/build/distribution is downloaded on the Cloudera Manager Server—not one per cluster. In the local parcel repository on the Cloudera Manager Server, the approximate sizes of the various parcels are as follows:
- CDH 5 (which includes Impala and Search) - 1.5 GB per parcel (packed), 2 GB per parcel (unpacked)
- Impala - 200 MB per parcel
- Cloudera Search - 400 MB per parcel
- Cloudera Management Service -The Host Monitor and Service Monitor databases are stored on the partition hosting /var. Ensure that you have at least 20 GB available on this partition.
- Agents - On Agent hosts, each unpacked parcel requires about three times the space of the downloaded parcel on the Cloudera Manager Server. By default, unpacked parcels are located in /opt/cloudera/parcels.
- Cloudera Manager Server
- RAM - 4 GB is recommended for most cases and is required when using Oracle databases. 2 GB might be sufficient for non-Oracle deployments with fewer than 100 hosts. However, to run the Cloudera Manager Server on a machine with 2 GB of RAM, you must tune down its maximum heap size (by modifying -Xmx in /etc/default/cloudera-scm-server). Otherwise the kernel might kill the Server for consuming too much RAM.
- Python - Cloudera Manager requires Python 2.4 or higher (but is not compatible with Python 3.0 or higher). Hue in CDH 5 and package installs of CDH 5 require Python 2.6 or 2.7. All supported operating systems include Python version 2.4 or higher. Cloudera Manager is compatible with Python 2.4 through the latest version of Python 2.x. Cloudera Manager does not support Python 3.0 and higher.
- Perl - Cloudera Manager requires perl.
- python-psycopg2 package - Cloudera Manager 5.8 and higher has a dependency on the package python-psycopg2. Any machine that runs the Cloudera Manager agent requires the package. This package is not available in standard SLES 11 and SLES 12 repositories. You need to add the repository for this package or install it manually before you install or upgrade Cloudera Manager. Add the repository from one of the following URLs:
- SLES 11 SP4: http://download.opensuse.org/repositories/devel:/languages:/python/SLE_11_SP4/devel:languages:python.repo
- SLES 12 SP2: http://download.opensuse.org/repositories/devel:/languages:/python/SLE_12_SP2/devel:languages:python.repo
Networking and Security Requirements
The hosts in a Cloudera Manager deployment must satisfy the following networking and security requirements:
- Networking Protocols Support
CDH requires IPv4. IPv6 is not supported and must be disabled.
See also Configuring Network Names.
- Multihoming Support
– Multihoming CDH or Cloudera Manager is not supported outside specifically certified Cloudera partner appliances. Cloudera finds that current Hadoop architectures combined with modern network infrastructures and security practices remove the need for multihoming. Multihoming, however, is beneficial internally in appliance form factors to take advantage of high-bandwidth InfiniBand interconnects.
Although some subareas of the product may work with unsupported custom multihoming configurations, there are known issues with multihoming. In addition, unknown issues may arise because multihoming is not covered by our test matrix outside the Cloudera-certified partner appliances.
- Cluster hosts must have a working network name resolution system and correctly formatted /etc/hostsfile. All cluster hosts must have properly configured forward and reverse host resolution through DNS. The /etc/hosts files must:
- Contain consistent information about hostnames and IP addresses across all hosts
- Not contain uppercase hostnames
- Not contain duplicate IP addresses
Cluster hosts must not use aliases, either in /etc/hosts or in configuring DNS. A properly formatted /etc/hosts file should be similar to the following example:
127.0.0.1 localhost.localdomain localhost
192.168.1.1 cluster-01.example.com cluster-01
192.168.1.2 cluster-02.example.com cluster-02
192.168.1.3 cluster-03.example.com cluster-03
- In most cases, the Cloudera Manager Server must have SSH access to the cluster hosts when you run the installation or upgrade wizard. You must log in using a root account or an account that has password-less sudo permission. For authentication during the installation and upgrade procedures, you must either enter the password or upload a public and private key pair for the root or sudo user account. If you want to use a public and private key pair, the public key must be installed on the cluster hosts before you use Cloudera Manager.
Cloudera Manager uses SSH only during the initial install or upgrade. Once the cluster is set up, you can disable root SSH access or change the root password. Cloudera Manager does not save SSH credentials, and all credential information is discarded when the installation is complete.
- If single user mode is not enabled, the Cloudera Manager Agent runs as root so that it can make sure the required directories are created and that processes and files are owned by the appropriate user (for example, the hdfs and mapred users).
- No blocking is done by Security-Enhanced Linux (SELinux).Note: Cloudera Enterprise is supported on platforms with Security-Enhanced Linux (SELinux) enabled. However, Cloudera does not support use of SELinux with Cloudera Navigator. Cloudera is not responsible for policy support nor policy enforcement. If you experience issues with SELinux, contact your OS provider.
- No blocking by iptables or firewalls; port 7180 must be open because it is used to access Cloudera Manager after installation. Cloudera Manager communicates using specific ports, which must be open.
- For RHEL and CentOS, the /etc/sysconfig/network file on each host must contain the hostname you have just set (or verified) for that host.
- Cloudera Manager and CDH use several user accounts and groups to complete their tasks. The set of user accounts and groups varies according to the components you choose to install. Do not delete these accounts or groups and do not modify their permissions and rights. Ensure that no existing systems prevent these accounts and groups from functioning. For example, if you have scripts that delete user accounts not in a whitelist, add these accounts to the list of permitted accounts. Cloudera Manager, CDH, and managed services create and use the following accounts and groups:
Users and Groups
|Unix User ID||Groups||Notes|
|Cloudera Manager (all versions)||cloudera-scm||cloudera-scm||Cloudera Manager processes such as the Cloudera Manager Server and the monitoring roles run as this user.
The Cloudera Manager keytab file must be named cmf.keytab since that name is hard-coded in Cloudera Manager.Note: Applicable to clusters managed by Cloudera Manager only.
|Apache Accumulo (Accumulo 1.4.3 and higher)||accumulo||accumulo||Accumulo processes run as this user.|
|Apache Avro||No special users.|
|Apache Flume (CDH 4, CDH 5)||flume||flume||The sink that writes to HDFS as this user must have write privileges.|
|Apache HBase (CDH 4, CDH 5)||hbase||hbase||The Master and the RegionServer processes run as this user.|
|HDFS (CDH 4, CDH 5)||hdfs||hdfs, hadoop||The NameNode and DataNodes run as this user, and the HDFS root directory as well as the directories used for edit logs should be owned by it.|
|Apache Hive (CDH 4, CDH 5)||hive||hive||
The HiveServer2 process and the Hive Metastore processes run as this user.
A user must be defined for Hive access to its Metastore DB (for example, MySQL or Postgres) but it can be any identifier and does not correspond to a Unix uid. This is javax.jdo.option.ConnectionUserName in hive-site.xml.
|Apache HCatalog (CDH 4.2 and higher, CDH 5)||hive||hive||
The WebHCat service (for REST access to Hive functionality) runs as the hive user.
|HttpFS (CDH 4, CDH 5)||httpfs||httpfs||
The HttpFS service runs as this user. See HttpFS Security Configuration for instructions on how to generate the merged httpfs-http.keytab file.
|Hue (CDH 4, CDH 5)||hue||hue||
Hue services run as this user.
|Hue Load Balancer (Cloudera Manager 5.5 and higher)||apache||apache||The Hue Load balancer has a dependency on the apache2 package that uses the apache user name. Cloudera Manager does not run processes using this user ID.|
|Cloudera Impala (CDH 4.1 and higher, CDH 5)||impala||impala, hive||Impala services run as this user.|
|Apache Kafka (Cloudera Distribution of Kafka 1.2.0)||kafka||kafka||Kafka services run as this user.|
|Java KeyStore KMS (CDH 5.2.1 and higher)||kms||kms||The Java KeyStore KMS service runs as this user.|
|Key Trustee KMS (CDH 5.3 and higher)||kms||kms||The Key Trustee KMS service runs as this user.|
|Key Trustee Server (CDH 5.4 and higher)||keytrustee||keytrustee||The Key Trustee Server service runs as this user.|
|Kudu||kudu||kudu||Kudu services run as this user.|
|Llama (CDH 5)||llama||llama||Llama runs as this user.|
|Apache Mahout||No special users.|
|MapReduce (CDH 4, CDH 5)||mapred||mapred, hadoop||Without Kerberos, the JobTracker and tasks run as this user. The LinuxTaskController binary is owned by this user for Kerberos.|
|Apache Oozie (CDH 4, CDH 5)||oozie||oozie||The Oozie service runs as this user.|
|Parquet||No special users.|
|Apache Pig||No special users.|
|Cloudera Search (CDH 4.3 and higher, CDH 5)||solr||solr||The Solr processes run as this user.|
|Apache Spark (CDH 5)||spark||spark||The Spark History Server process runs as this user.|
|Apache Sentry (CDH 5.1 and higher)||sentry||sentry||The Sentry service runs as this user.|
|Apache Sqoop (CDH 4, CDH 5)||sqoop||sqoop||This user is only for the Sqoop1 Metastore, a configuration option that is not recommended.|
|Apache Sqoop2 (CDH 4.2 and higher, CDH 5)||sqoop2||sqoop, sqoop2||The Sqoop2 service runs as this user.|
|Apache Whirr||No special users.|
|YARN (CDH 4, CDH 5)||yarn||yarn, hadoop||Without Kerberos, all YARN services and applications run as this user. The LinuxContainerExecutor binary is owned by this user for Kerberos.|
|Apache ZooKeeper (CDH 4, CDH 5)||zookeeper||zookeeper||The ZooKeeper processes run as this user. It is not configurable.|
- Amazon S3
- Amazon S3 Consistency with Metadata Caching (S3Guard)
Data written to Amazon S3 buckets is subject to the "eventual consistency" guarantee provided by S3, which means that data written to S3 may not be immediately available for queries and listing operations. This can cause failures in multi-step ETL workflows, where data from a previous step is not available to the next step. To mitigate these consistency issues you can now configure metadata caching for data stored in Amazon S3 using S3Guard. Some workloads that access S3 may also see modest performance improvements with metadata caching. S3Guard requires that you provision a DynamoDB database from Amazon Web Services and configure S3Guard using the Cloudera Manager Admin Console or command-line tools. See Configuring and Managing S3Guard.
- Amazon S3 Consistency with Metadata Caching (S3Guard)
- Operating System Support
- SLES 12 SP2 Support
SLES 12, SP2 is now supported as of Cloudera Manager and CDH 5.11 and higher.
- Mixed Operating system support for gateway hosts running Cloudera Data Science Workbench
A Gateway host that is dedicated to running Cloudera Data Science Workbench can use RHEL/CentOS 7.2 even if the remaining hosts in your cluster are running any of the other supported operating systems. All hosts must run the same version of the Oracle JDK.
- SLES 12 SP2 Support
- Backup and Disaster Recovery
- Refreshing Impala metadata during replication.
You can now configure Hive/Impala replication jobs to run the INVALIDATE METADATA Impala statement in the destination cluster automatically at the end of the replication process, allowing newly replicated data to be immediately queried by Impala. See Refreshing Impala Metadata.
- Hive Replication to Amazon S3 now supported for regions that support only Signature Version 4 signing protocol
Replications from Hive or HBase to Amazon S3 are now supported for S3 regions that only support Amazon's Signature Version 4 signing protocol. You must add the fs.s3a.endpoint property to the Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml property and set its value to the Amazon S3 region. For example:
You can access this property in Cloudera Manager at Home > Configuration > Advanced Configuration Snippets.
- Refreshing Impala metadata during replication.
- Peak Memory Usage Filter now tracked per container for YARN applications
Peak container memory usage is now tracked for YARN applications and new filter attribute, Used Memory Max has been added for monitoring YARN applications.
- Improved Kerberos-Encryption-Type Handling by Cloudera Manager
Cloudera Manager validates the Kerberos encryption type as it is being entered into the Cloudera Manager Admin Console, and displays an error message if the type is not a valid MIT or Microsoft Active Directory (Kerberos) encryption type. Administrators can disable the feature when necessary—for example, if new encryption types added to Kerberos are ahead of the encryption types supported by Cloudera Manager (invalid encryption types fail, regardless of warning message display).
- Enabling SPNEGO authentication for Hue
Enabling the Hue Authentication Backend property (for SPNEGO) now automatically adds all necessary environments and kerberos credentials. Previously, you needed to follow this procedure: Enabling SPNEGO as an Authentication Backend for Hue.
- New and Changed Configuration
- Auto-configuration of HBase when Kerberos is enabled
When Kerberos is enabled for the cluster, the value of the HBase configuration parameter HBase Thrift Authentication is automatically set to auth-conf.
- New HDFS NameNode configuration property for deleting the trash
A new HDFS NameNode property, Filesystem Trash Checkpoint Interval (fs.trash.checkpoint.interval) has been introduced with a default value of 1 hour. This property causes the NameNode better respect and accurately enforce the configured HDFS trash deletion interval set with the Filesystem Trash Interval property (fs.trash.interval).
The old behaviour without this property accidentally caused many files in the HDFS trash to be deleted only when twice the desired trash deletion interval had transpired because the checkpoint interval matched the deletion interval. If the older implicit behaviour of retaining trash files for a longer time is desired, consider raising the value of theFilesystem Trash Interval property to a more suitable value. Changing this property causes all HDFS NameNode role instances to be marked stale, and therefore requires that you restart the HDFS NameNode role instances and their dependent services.
- New Auto Logout Timeout property for Hue
A new configuration property has been added for the Hue service. The Auto Logout Timeoutproperty controls how long the Hue browser can remain idle before automatically logging out the user. Set the property to -1 to disable automatic logout. To configure the property, go to the Hue service, select theConfiguration tab and search for the property.
- New performance tuning properties for Key Management Server (KMS)
The following new properties have been added for tuning the performance of the KMS service:
- KMS Accept Count
- KMS Handler Protocol
- KMS Acceptor Thread Count
- Auto-configuration of HBase when Kerberos is enabled
- New API endpoint for refreshing parcel information
A new REST API endpoint has been added to refresh parcel information from both local and remote repositories. The endpoint URL is:
- New Metrics and Health Tests for Service Monitor and Host Monitor metric collection
A new metric, mgmt_aggregation_run_duration, has been added to the Service Monitor and Host Monitor metrics to indicate how much time it takes to store metrics collected in last minute. This metric can be used to determine if more heap or non-heap memory is needed for these roles.
New Health Tests, Host Monitor Metrics Aggregation Run Duration Test and Service Monitor Metrics Aggregation Run Duration Test have also been added to detect potential resource configuration issues with service monitor and host monitor.
- New validation for YARN NodeManger log directory
Cloudera Manager now validates whether all YARN NodeManagers are storing logs in the same distributed file system directory so that no logs are missing from Job History Server. If NodeManagers have different configuration values, there will be a configuration error after upgrading Cloudera Manager to 5.11.
- New Dynamic Resource Pool option
Configuration of Nested User Pools (except existingSecondaryGroup) now includes a Create pool if it does not exist checkbox to indicate whether to create a sub-pool.
- Change to default fencing method for HDFS High Availability
The default fencing method used for HDFS HA is shell(true). Previously, the default was shell(./cloudera_manager_agent_fencer.py). If the cluster is configured with the previous default value, the upgrade process updates the fencing method to use shell(true). This change requires you to restart the HDFS service and any dependent services. If the cluster uses a custom fencing method, no change occurs.
Want to Get Involved or Learn More?
Check out our other resources
Receive expert Hadoop training through Cloudera University, the industry's only truly dynamic Hadoop training curriculum that’s updated regularly to reflect the state of the art in big data.