Your browser is out of date!

Update your browser to view this website correctly. Update my browser now

×

Long term component architecture

As the main curator of open standards in Hadoop, Cloudera has a track record of bringing new open source solutions into its platform (such as Apache Spark, Apache HBase, and Apache Parquet) that are eventually adopted by the community at large. As standards, you can build longterm architecture on these components with confidence.

Thank you for choosing CDH, your download instructions are below:

Installation

This section introduces options for installing Cloudera Manager, CDH, and managed services. You can install:

  • Cloudera Manager, CDH, and managed services in a Cloudera Manager deployment. This is the recommended method for installing CDH and managed services.
  • CDH 5 into an unmanaged deployment.

Continue reading:

 

 

Cloudera Manager Deployment

A Cloudera Manager deployment consists of the following software components:

  • Oracle JDK
  • Cloudera Manager Server and Agent packages
  • Supporting database software
  • CDH and managed service software
This section describes the three main installation paths for creating a new Cloudera Manager deployment and the criteria for choosing an installation path. If your cluster already has an installation of a previous version of Cloudera Manager, follow the instructions in Upgrading Cloudera Manager.

The Cloudera Manager installation paths share some common phases, but the variant aspects of each path support different user and cluster host requirements:

  • Demonstration and proof of concept deployments - There are two installation options:
    • Installation Path A - Automated Installation by Cloudera Manager - Cloudera Manager automates the installation of the Oracle JDK, Cloudera Manager Server, embedded PostgreSQL database, and Cloudera Manager Agent, CDH, and managed service software on cluster hosts, and configures databases for the Cloudera Manager Server and Hive Metastore and optionally for Cloudera Management Service roles. This path is recommended for demonstration and proof of concept deployments, but is not recommended for production deployments because its not intended to scale and may require database migration as your cluster grows. To use this method, server and cluster hosts must satisfy the following requirements:
      • Provide the ability to log in to the Cloudera Manager Server host using a root account or an account that has password-less sudo permission.
      • Allow the Cloudera Manager Server host to have uniform SSH access on the same port to all hosts. See Networking and Security Requirements for further information.
      • All hosts must have access to standard package repositories and either archive.cloudera.com or a local repository with the necessary installation files.
    • Installation Path B - Manual Installation Using Cloudera Manager Packages - you install the Oracle JDK and Cloudera Manager Server, and embedded PostgreSQL database packages on the Cloudera Manager Server host. You have two options for installing Oracle JDK, Cloudera Manager Agent, CDH, and managed service software on cluster hosts: manually install it yourself or use Cloudera Manager to automate installation. However, in order for Cloudera Manager to automate installation of Cloudera Manager Agent packages or CDH and managed service software, cluster hosts must satisfy the following requirements:
      • Allow the Cloudera Manager Server host to have uniform SSH access on the same port to all hosts. See Networking and Security Requirements for further information.
      • All hosts must have access to standard package repositories and either archive.cloudera.com or a local repository with the necessary installation files.
  • Production deployments - require you to first manually install and configure a production database for the Cloudera Manager Server and Hive Metastore. There are two installation options:
    • Installation Path B - Manual Installation Using Cloudera Manager Packages - you install the Oracle JDK and Cloudera Manager Server packages on the Cloudera Manager Server host. You have two options for installing Oracle JDK, Cloudera Manager Agent, CDH, and managed service software on cluster hosts: manually install it yourself or use Cloudera Manager to automate installation. However, in order for Cloudera Manager to automate installation of Cloudera Manager Agent packages or CDH and managed service software, cluster hosts must satisfy the following requirements:
      • Allow the Cloudera Manager Server host to have uniform SSH access on the same port to all hosts. See Networking and Security Requirements for further information.
      • All hosts must have access to standard package repositories and either archive.cloudera.com or a local repository with the necessary installation files.
    • Installation Path C - Manual Installation Using Cloudera Manager Tarballs - you install the Oracle JDK, Cloudera Manager Server, and Cloudera Manager Agent software as tarballs and use Cloudera Manager to automate installation of CDH and managed service software as parcels.

 

Unmanaged Deployment

In an unmanaged deployment, you are responsible for managing all phases of the life cycle of CDH and managed service components on each host: installation, configuration, and service life cycle operations such as start and stop. This section describes alternatives for installing CDH 5 software in an unmanaged deployment.

  • Command-line methods:
    • Download and install the CDH 5 "1-click Install" package
    • Add the CDH 5 repository
    • Build your own CDH 5 repository
    If you use one of these command-line methods, the first (downloading and installing the "1-click Install" package) is recommended in most cases because it is simpler than building or adding a repository. See Installing the Latest CDH 5 Release for detailed instructions for each of these options.
  • Tarball You can download a tarball from CDH downloads. Keep the following points in mind:
    • Installing CDH 5 from a tarball installs YARN.
    • In CDH 5, there is no separate tarball for MRv1. Instead, the MRv1 binaries, examples, etc., are delivered in the Hadoop tarball. The scripts for running MRv1 are in the bin-mapreduce1 directory in the tarball, and the MRv1 examples are in the examples-mapreduce1 directory.

Please Read and Accept our Terms

CDH 5 provides packages for Red-Hat-compatible, SLES, Ubuntu, and Debian systems as described below.

Operating System Version Packages
Red Hat Enterprise Linux (RHEL)-compatible
Red Hat Enterprise Linux 5.7 64-bit
  6.2 64-bit
  6.4 64-bit
  6.4 in SE Linux mode 64-bit
  6.5 64-bit
CentOS 5.7 64-bit
  6.2 64-bit
  6.4 64-bit
  6.4 in SE Linux mode 64-bit
  6.5 64-bit
Oracle Linux with default kernel and Unbreakable Enterprise Kernel 5.6 (UEK R2) 64-bit
  6.4 (UEK R2) 64-bit
  6.5 (UEK R2, UEK R3) 64-bit
SLES
SLES Linux Enterprise Server (SLES) 11 with Service Pack 2 or later 64-bit
Ubuntu/Debian
Ubuntu Precise (12.04) - Long-Term Support (LTS) 64-bit
  Trusty (14.04) - Long-Term Support (LTS) 64-bit
Debian Wheezy (7.0, 7.1) 64-bit

Note:

  • CDH 5 provides only 64-bit packages.
  • Cloudera has received reports that our RPMs work well on Fedora, but we have not tested this.
  • If you are using an operating system that is not supported by Cloudera packages, you can also download source tarballs from Downloads.

 

Selected tab: SupportedOperatingSystems

 

 

Component MySQL SQLite PostgreSQL Oracle Derby - see Note 4
Oozie 5.5, 5.6 8.4, 9.1, 9.2, 9.3

See Note 2

11gR2 Default
Flume Default (for the JDBC Channel only)
Hue 5.5, 5.6

See Note 1

Default 8.4, 9.1, 9.2, 9.3

See Note 2

11gR2
Hive/Impala 5.5, 5.6

See Note 1

8.4, 9.1, 9.2, 9.3

See Note 2

11gR2 Default
Sentry 5.5, 5.6

See Note 1

8.4, 9.1, 9.2,, 9.3

See Note 2

11gR2
Sqoop 1 See Note 3 See Note 3 See Note 3
Sqoop 2 See Note 4 See Note 4 See Note 4 Default

Note:

  1. MySQL 5.5 is supported on CDH 5.1. MySQL 5.6 is supported on CDH 5.1 and later.
  2. PostgreSQL 9.2 is supported on CDH 5.1 and later. PostgreSQL 9.3 is supported on CDH 5.2 and later.
  3. For the purposes of transferring data only, Sqoop 1 supports MySQL 5.0 and above, PostgreSQL 8.4 and above, Oracle 10.2 and above, Teradata 13.10 and above, and Netezza TwinFin 5.0 and above. The Sqoop metastore works only with HSQLDB (1.8.0 and higher 1.x versions; the metastore does not work with any HSQLDB 2.x versions).
  4. Sqoop 2 can transfer data to and from MySQL 5.0 and above, PostgreSQL 8.4 and above, Oracle 10.2 and above, and Microsoft SQL Server 2012 and above. The Sqoop 2 repository database is supported only on Derby.
  5. Derby is supported as shown in the table, but not always recommended. See the pages for individual components in the Cloudera Installation and Upgrade guide for recommendations.

 

 

 

 

Selected tab: SupportedDatabases

CDH 5 is supported with the versions shown in the table that follows.

Table 1. Supported JDK Versions

Latest Certified Version Minimum Supported Version Exceptions
1.7.0_67 1.7.0_67 None
1.8.0_11 1.8.0_11 None

Selected tab: SupportedJDKVersions

CDH requires IPv4. IPv6 is not supported.

See also Configuring Network Names.

Selected tab: SupportedInternetProtocol
Selected tab: SystemRequirements

What's New in CDH 5.2.0

Operating System Support

CDH 5.2.0 adds support for Ubuntu Trusty (version 14.04). See Supported Operating Systems.

Important:

Installing CDH by adding a repository entails an additional step on Ubuntu Trusty, to ensure that you get the CDH version of ZooKeeper, rather than the version that is bundled with Trusty. See Steps to Install CDH 5 Manually.

 

Apache Avro

CDH 5.2 implements Avro version 1.7.6, with backports from 1.7.7. Important changes include:

  • AVRO-1398: Increase default sync interval from 16k to 64k. There is a very small chance this could causes an incompatibility in some cases, but you can be control the interval by setting avro.mapred.sync.interval in the MapReduce job configuration. For example, set it to 16000 to get the old behavior.
  • AVRO-1355: Record schema should reject duplicate field names. This change rejects schemas with duplicate field names. This could affect some applications, but if schemas have duplicate field names then they are unlikely to work properly in any case. The workaround is to make sure a record's field names are unique within the record.

 

Apache Hadoop

HDFS

CDH 5.2 provides the following new capabilities:

  • HDFS Data at Rest Encryption

    Important: The HDFS Data at Rest Encryption feature included in CDH 5.2.0 has several known limitations. Therefore, Cloudera does not currently support this feature and it is not recommended for production use. If you're interested in trying the feature out in a test environment, contact your account team.

    HDFS now implements transparent, end-to-end encryption of data read from and written to HDFS by creating encryption zones. An encryption zone is a directory in HDFS with all of its contents, that is, every file and subdirectory in it, encrypted. For more details, see HDFS Data At Rest Encryption.
  • Extended attributes: HDFS XAttrs allow extended attributes to be stored per file (https://issues.apache.org/jira/browse/HDFS-2006).
  • Authentication improvements when using an HTTP proxy server.
  • A new Hadoop Metrics sink that allows writing directly to Graphite.
  • Specification for Hadoop Compatible Filesystem effort.
  • OfflineImageViewer to browse an fsimage via the WebHDFS API.
  • Supportability improvements and bug fixes to the NFS gateway.
  • Modernized web UIs (HTML5 and JavaScript) for HDFS daemons.

 

MapReduce

CDH 5.2 provides an optimized implementation of the mapper side of the MapReduce shuffle. The optimized implementation may require tuning different from the original implementation, and so it is considered experimental and is not enabled by default.

You can select this new implementation on a per-job basis by setting the job configuration value ­mapreduce.job.map.output.collector.class toorg.apache.hadoop.mapred.nativetask.NativeMapOutputCollectorDelegator, or use enable Cloudera Manager to enable it.

Some jobs which use custom writable types or comparators may not be able to take advantage of the optimized implementation.

the following new capabilities and improvements:

 

YARN

CDH 5.2 provides the following new capabilities and improvements:

  • New features and improvements in the Fair Scheduler:
    • New features:
      • Fair Scheduler now allows setting the fairsharePreemptionThreshold per queue (leaf and non-leaf). This threshold is a decimal value between 0 and 1; if a queue's usage is under (preemption-threshold * fairshare) for a configured duration, resources from other queues are preempted to satisfy this queue's request. Set this value in fair-scheduler.xml. The default value is 0.5.
      • Fair Scheduler now allows setting the fairsharePreemptionTimeout per queue (leaf and non-leaf). For a starved queue, this timeout determines when to trigger preemption from other queues. Set this value in fair-scheduler.xml.
      • Fair Scheduler now shows the Steady Fair Share in the Web UI. The Steady Fair Share is the share of the cluster resources a particular queue or pool would get if all existing queues had running applications.
    • Improvements:
      • Fair Scheduler uses Instantaneous Fair Share (fairshare that considers only active queues) for scheduling decisions to improve the time to achieve steady state (fairshare).
      • The default for maxAMShare is now 0.5, meaning that only half the cluster's resources can be taken up by Application Masters. You can change this value in fair-scheduler.xml.
  • YARN's REST APIs support submitting and killing applications.
  • YARN's timeline store is integrated with Kerberos.

 

Apache Crunch

CDH 5.2 provides the following new capabilities:

  • Improvements in Scrunch, including:
    • New join API that matches the one in Crunch
    • New aggregation API, including support for Algebird-based aggregations
    • Built-in serialization support for all tuple types as well as case classes.
  • A new module, crunch-hive, for reading and writing Optimized Row Columnar (ORC) Files with Crunch.
 

Apache Flume

CDH 5.2 provides the following new capabilities:

  • Kafka Integration: Flume can now accept data from Kafka via the KafkaSource (FLUME-2250) and push to Kafka using the KafkaSink (FLUME-2251).
  • Kite Sink can now write to Hive and HBase datasets (FLUME-2463).
  • Flume agents can now be configured via Zookeeper (experimental, FLUME-1491)
  • Embedded Agents now support Interceptors (FLUME-2426)
  • syslog Sources now support configuring which fields should be kept (FLUME-2438)
  • File Channel replay is now much faster (FLUME-2450)
  • New regular-expression search-and-replace interceptor (FLUME-2431)
  • Backup checkpoints can be optionally compressed (FLUME-2401)

 

Hue

CDH 5.2 provides the following new capabilities:

  • New application for editing Sentry roles and Privileges on databases and tables
  • Search App
  • Heatmap, Tree, Leaflet widgets
  • Micro-analysis of fields
  • Exclusion facets
  • Oozie Dashboard: bulk actions, faster display
  • File Browser: drag-and-drop upload, history, ACLs edition
  • Hive and Impala: LDAP pass-through, query expiration, SSL (Hive), new graphs
  • Job Browser: YARN kill application button
 

Apache HBase

CDH 5.2 implements HBase 0.98.6, which represents a minor upgrade to HBase. This upgrade introduces new features and moves some features which were previously marked as experimental to fully supported status. For detailed information and instructions on how to use the new capabilities, see New Features and Changes for HBase in CDH 5.

 

Apache Hive

CDH 5.2 introduces the following important changes in Hive.

  • CDH 5.2 implements Hive 0.13, providing the following new capabilities:
    • Sub-queries in the WHERE clause
    • Common table expressions (CTE)
    • Parquet supports timestamp
    • HiveServer2 can be configured with a hiverc file that is automatically run when users connect
    • Permanent UDFs
    • HiveServer2 session and operation timeouts
    • Beeline accepts a -i option to initialize with a SQL file
    • New join syntax (implicit joins)
  • As of CDH 5.2.0, you can create Avro-backed tables simply by using STORED AS AVRO in a DDL statement. The AvroSerDe takes care of creating the appropriate Avro schema from the Hive table schema, making it much easier to use Avro with Hive.
  • Hive supports additional datatypes, as follows:
    • Hive can read char and varchar datatypes written by Hive, and char and varchar datatypes written by Impala.
    • Impala can read char and varchar datatypes written by Hive and Impala.
    These new types have been enabled by expanding the supported DDL, so they are backward compatible. You can add varchar(n) columns by creating new tables with that type, or changing a string column in existing tables to varchar.

    Note:

    char(n) columns are not stored in a fixed-length representation, and do not improve performance (as they do in some other databases). Cloudera recommends that in most cases you use text or varchar instead.

     

  • DESCRIBE DATABASE returns additional fields: owner_name and owner_type. The command will continue to behave as expected if you identify the field you're interested in by its (string) name, but could produce unexpected results if you use a numeric index to identify the field(s).
 

Kite

Kite is an open source set of libraries, references, tutorials, and code samples for building data-oriented systems and applications. For more information about Kite, see the Kite SDK Development Guide.

Kite has been rebased to version 0.15.0 in CDH 5.2.0, from the base version 0.10.0 in CDH 5.1. kite-morphlines modules are backward-compatible, but this change breaks backward-compatibility for the kite-data API.

 

Kite Data

The Kite data API has had substantial updates since the version included in CDH 5.1.

Changes from 0.15.0

The Kite version in CDH 5.2 is based on 0.15.0, but includes some newer changes. Specifically, it includes support for dataset namespaces, which can be used to set the database in the Hive Metastore.

The introduction of namespaces changed the file system repository layout; now there is an additional namespace directory for datasets stored in HDFS (repository/namespace/dataset/). There are no compatibility problems when you use Dataset URIs, but all datasets created with the DatasetRepositoryAPI will be located in a namespace directory. This new directory level is not expected in Kite 0.15.0 or 0.16.0 and will prevent the dataset from being loaded. The work-around is to switch to using Dataset URIs (see below) that include the namespace component. Existing datasets will work without modification.

Except as noted above, Kite 0.15.0 in CDH 5.2 is fully backward-compatible. It can load datasets written with any previous Kite version.

 

Dataset URIs

Datasets are identified with a single URI, rather than a repository URI and dataset name. The dataset URI contains all the information Kite needs to determine which implementation (Hive, HBase, or HDFS) to use for the dataset, and includes both the dataset's name and a namespace.

The Kite API has been updated so that developers call methods in the Datasets utility class as they would use DatasetRepository methods. TheDatasets methods are recommended, and the DatasetRepository API is deprecated.

 

Views

The Kite data API now allows you to select a view of the dataset by setting constraints. These constraints are used by Kite to automatically prune unnecessary partitions and filter records.

 

MapReduce input and output formats

The kite-data-mapreduce module has been added. It provides both DatasetKeyInputFormat and DatasetKeyOutputFormat that allow you to run MapReduce jobs over datasets or views. Spark is also supported by the input and output formats.

 

Dataset CLI tool

Kite now includes a command-line utility that can run common maintenance tasks, like creating a dataset, migrating a dataset's schema, copying from one dataset to another, and importing CSV data. It also has helpers that can create Avro schemas from data files and other Kite-related configuration.

 

Flume DatasetSink

The Flume DatasetSink has been updated for the kite-data API changes. It supports all previous configurations without modification.

In addition, the DatasetSink now supports dataset URIs with the configuration option kite.dataset.uri.

 

Apache Oozie

CDH 5.2 introduces the following important changes:

  • A new Hive 2 Action allows Oozie to run HiveServer2 scripts. Using the Hive Action with HiveServer2 is now deprecated; you should switch to the new Hive 2 Action as soon as possible.
  • The MapReduce action can now also be configured by Java code

    This gives users the flexibility of using their own driver Java code for configuring the MR job, while also getting the advantages of the MapReduce action (instead of using the Java action). See the documentation for more info.

  • The PurgeService is now able to remove completed child jobs from long running coordinator jobs

  • ALL can now be set for oozie.service.LiteWorkflowStoreService.user.retry.error.code.ext to make Oozie retry actions automatically for every type of error

  • All Oozie servers in an Oozie HA group now synchronize on the same randomly generated rolling secret for signing auth tokens

  • You can now upgrade from CDH 4.x to CDH 5.2 and later with jobs in RUNNING and SUSPENDED states. (An upgrade from CDH 4.x to a CDH 5.x releaseearlier than CDH 5.2.0 would still require that no jobs be in either of those states).

 

Apache Parquet (incubating)

CDH 5.2 Parquet is rebased on Parquet 1.5 and Parquet-format 2.1.0.

 

Cloudera Search

New Features:

  • Cloudera Search adds support for Spark indexing using the CrunchIndexerTool. For more information, see Spark Indexing Reference (CDH 5.2 or later only).
  • Cloudera Search adds support for the SolrStorageHandler for use with Hive. For more information, see SolrStorageHandler Reference (CDH 5.2 or later only).
  • Cloudera Search adds fault tolerance for single-shard deployments. This fault tolerance is enabled with a new -a option in solrctl, which configures shards to automatically be re-added on an existing, healthy node if the node hosting the shard become unavailable.
  • Cloudera Search includes a version of Kite 0.10.0, which includes all morphlines-related backports of all fixes and features in Kite 0.17.0. For additional information on Kite, see:
  • Search adds support for multi-threaded faceting on fields. This enables parallelizing operations, allowing them to run more quickly on highly concurrent hardware. This is especially helpful in cases where faceting operations apply to large datasets over many fields. For more information, see Tuning the Solr Server.
  • Search adds support for distributed pivot faceting, enabling faceting on multi-shard collections.

 

Apache Sentry (incubating)

CDH 5.2 introduces the following changes to Sentry.

Sentry Service:

  • If you are using the database-backed Sentry service, upgrading from CDH 5.1 to CDH 5.2 will require a schema upgrade. For instructions, see Installing and Upgrading Sentry.
  • Hive SQL Syntax:
    • GRANT and REVOKE statements have been expanded to include WITH GRANT OPTION, thus allowing you to delegate granting and revoking privileges.
    • The SHOW GRANT ROLE command has been updated to allow non-admin users to list grants for roles that are currently assigned to them.
    • The SHOW ROLE GRANT GROUP <groupName> command has been updated to allow non-admin users that are part of the group specified by<groupName> to list all roles assigned to this group.

      For more details on these changes, see the updated Hive SQL Syntax.

       

Apache Spark

CDH 5.2 Spark is rebased on Apache Spark/Streaming 1.1 and provides the following new capabilities:

  • Stability and performance improvements.
  • New sort-based shuffle implementation (disabled by default).
  • Better performance monitoring through the Spark UI.
  • Support for arbitrary Hadoop InputFormats in PySpark.
  • Improved Yarn support with several bug fixes.

 

Apache Sqoop

CDH 5.2 Sqoop 1 is rebased on Sqoop 1.4.5 and includes the following changes:

  • Mainframe connector added.
  • Parquet support added.

 

There are no changes for Sqoop 2.

 

 

 

 

Selected tab: WhatsNew

Want to Get Involved or Learn More?

Check out our other resources

Cloudera Community

Collaborate with your peers, industry experts, and Clouderans to make the most of your investment in Hadoop.

Cloudera University

Receive expert Hadoop training through Cloudera University, the industry's only truly dynamic Hadoop training curriculum that’s updated regularly to reflect the state of the art in big data.