Data Replication

Cloudera Manager provides rich functionality for replicating data (stored in HDFS or accessed through Hive) across data centers. When critical data is stored on HDFS, Cloudera Manager provides the necessary capabilities to ensure that the data is available at all times, even in the face of the complete shutdown of a data center.

For recommendations on using data replication and Sentry authorization, see Configuring Sentry to Enable BDR Replication.

In Cloudera Manager 5, replication is supported between CDH 5 or CDH 4 clusters. In Cloudera Manager 5, support for HDFS and Hive replication is as follows.

Supported Replication Scenarios

  • HDFS and Hive
    • Cloudera Manager 4 with CDH 4 to Cloudera Manager 5 with CDH 4.
    • Cloudera Manager 5 with CDH 4 to Cloudera Manager 4.7.3 or later with CDH 4.
    • Cloudera Manager 5 with CDH 4 to Cloudera Manager 5 with CDH 4.
    • Cloudera Manager 5 with CDH 5 to Cloudera Manager 5 with CDH 5.
    • Cloudera Manager 4 or 5 with CDH 4.4 or later to Cloudera Manager 5 with CDH 5.
    • Cloudera Manager 5 with CDH 5 to Cloudera Manager 5 with CDH 4.4 or later.
    • (HDFS only) Within one Cloudera Manager instance, from one directory to another directory within the same cluster or to a different cluster. Both clusters must be running CDH 4.8 or higher.
  • SSL
    • Between CDH 5.0 with SSL and CDH 5.0 with SSL.
    • Between CDH 5.0 with SSL and CDH 5.0 without SSL.
    • From a CDH 5.1 source cluster with SSL and YARN.

Unsupported Replication Scenarios

  • HDFS and Hive
    • Cloudera Manager 5 with CDH 5 as the source, and Cloudera Manager 4 with CDH 4 as the target.
    • Between Cloudera Enterprise and any Cloudera Manager free edition:Cloudera Express, Cloudera Standard, Cloudera Manager Free Edition.
    • Between CDH 5 and CDH 4 (in either direction) where the replicated data includes a directory that contains a large number of files or subdirectories (several hundreds of thousands of entries), causing out-of-memory errors. This is because of limitations in the WebHDFS API. The workaround is to increase the heap size as follows:
      1. On the target Cloudera Manager instance, go to the HDFS service page.
      2. Click the Configuration tab.
      3. Expand the Service-Wide category.
      4. Click Advanced > HDFS Replication Advanced Configuration Snippet.
      5. Increase the heap size by adding a key-value pair, for instance, HADOOP_CLIENT_OPTS=-Xmx1g. In this example, 1g sets the heap size to 1 GB. This value should be adjusted depending on the number of files and directories being replicated.
    • Replication involving HDFS data from CDH 5 HA to CDH 4 clusters or CDH 4 HA to CDH5 clusters will fail if a NameNode failover happens during replication. This is because of limitations in the CDH WebHDFS API.
  • HDFS
    • Between a source cluster that has encryption over-the-wire enabled and a target cluster running CDH 4.0. This is because the CDH 4 client is used for replication in this case, and it does not support this.
    • From CDH 5 to CDH 4 where there are URL-encoding characters such as % in file and directory names. This is because of a bug in the CDH 4 WebHDFS API.
    • HDFS replication does not work from CDH 5 to CDH 4 with different realms when using older JDK versions. This is because of a JDK SPNEGO issue. For more information, see JDK-6670362. Use JDK 7 or upgrade to JDK6u34 or later on the CDH 4 cluster to work around this issue.
    • Replication for HDFS paths with encryption-at-rest enabled is not currently supported.
  • Hive
    • With data replication, between a source cluster that has encryption enabled and a target cluster running CDH 4. This is because the CDH 4 client used for replication does not support encryption.
    • Without data replication, between a source cluster running CDH 4 and a target cluster that has encryption enabled.
    • Between CDH 4.2 or later and CDH 4, if the Hive schema contains views.
    • With the same cluster as both source and destination
    • Replication from CDH 4 to CDH 5 HA can fail if a NameNode failover happens during replication.
    • Hive replication from CDH 5 to CDH 4 with different realms with older JDK versions, if data replication is enabled (since this involves HDFS replication). This is because of a JDK SPNEGO issue. For more information, see JDK-6670362. Use JDK 7 or upgrade to JDK6u34 or later on the CDH 4 cluster to work around this issue.
    • Hive replication from CDH 4 to CDH 5 with different realms with older JDK versions (even without data replication enabled). This is because of a JDK SPNEGO issue. For more information, see JDK-6670362. Use JDK 7 or upgrade to JDK6u34 or later on the CDH 4 cluster to work around this issue.
    • Replication for Hive data from HDFS paths with encryption-at-rest enabled is not currently supported.
    • Cloudera Manager 5.2 only supports replication of Impala UDFs if running CDH 5.2 or later. In clusters running CM5.2 and a CDH version earlier than 5.2 that include Impala User-Defined Functions (UDFs), Hive replication will succeed, but replication of the Impala UDFs will be skipped.
  • SSL
    • From a CDH 4.x source cluster with SSL.
    • From CDH 5.0 source cluster with SSL and YARN (because of a YARN bug).
    • Between CDH 5.0 with SSL and CDH 4.x.
  • Kerberos
    • From a source cluster configured to use Kerberos authentication to a target cluster that is not configured to use Kerberos authentication.
    • From a source cluster not configured to use Kerberos authentication to a target cluster that is configured to use Kerberos authentication.