Data Replication

Cloudera Manager enables you to replicate data across data centers for disaster recovery scenarios. Replications can include data stored in HDFS, data stored in Hive tables, Hive metastore data, and Impala metadata (catalog server metadata) associated with Impala tables registered in the Hive metastore. When critical data is stored on HDFS, Cloudera Manager helps to ensure that the data is available at all times, even in case of complete shutdown of a datacenter.

You can also replicate HDFS data to and from Amazon S3 and you can replicate Hive data and metadata to and from Amazon S3.

For an overview of data replication, view this video about Backing Up Data Using Cloudera Manager.

You can also use the HBase shell to replicate HBase data. (Cloudera Manager does not manage HBase replications.)

For recommendations on using data replication and Sentry authorization, see Configuring Sentry to Enable BDR Replication.

Video: Backing up Data Using Cloudera Manager

Cloudera License Requirements for Replication

Both the source and destination clusters must have a Cloudera Enterprise license.

Requirements for Replicating Highly Available Clusters

You must use unique nameservice names for HDFS clusters that are highly available.

Supported and Unsupported Replication Scenarios

Supported Replication Scenarios

In Cloudera Manager 5, replication is supported between CDH 5 or CDH 4 clusters. The following tables describe support for HDFS and Hive/Impala replication. The versions listed in the table are the lowest version of Cloudera Manager and CDH required to perform the replication. For example, replication with TLS/SSL enabled on Hadoop services on the source and destination clusters requires Cloudera Manager and CDH version 5.0 or higher on the source and destination.

The tables below list the supported replication scenarios:

Service Source Destination
  Cloudera Manager Version CDH Version Comment Cloudera Manager Version CDH Version Comment
HDFS, Hive 4 4   5 4  
HDFS, Hive 4 4.4   5 5  
HDFS, Hive 5 5 TLS/SSL enabled on Hadoop services 5 5 TLS/SSL enabled on Hadoop services
HDFS, Hive 5 5 TLS/SSL enabled on Hadoop services 5 5 TLS/SSL not enabled on Hadoop services
HDFS, Hive 5 5.1 TLS/SSL enabled on Hadoop services and YARN 5 4 or 5  
HDFS, Hive 5 4   4.7.3 or higher 4  
HDFS, Hive 5 4   5 4  
HDFS, Hive 5 5   5 5  
HDFS, Hive 5 5   5 4.4 or higher  
HDFS, Hive 5.7 5.7, with Isilon storage See Supported Replication Scenarios for Clusters using Isilon Storage. 5.7 5.7, with Isilon storage See Supported Replication Scenarios for Clusters using Isilon Storage.
HDFS, Hive 5.7 5.7 5.7 5.7, with Isilon storage
HDFS, Hive 5.7 5.7, with Isilon storage 5.7 5.7
HDFS, Hive 5.8 5.8, with or without Isilon Storage 5.8 5.8, with or without Isilon Storage

Unsupported Replication Scenarios

Service Source Destination
  Cloudera Manager Version CDH Version Comment Cloudera Manager Version CDH Version Comment
Any 4 or 5 4 or 5 Kerberos enabled. 4 or 5 4 or 5 Kerberos not enabled
Any 4 or 5 4 or 5 Kerberos not enabled. 4 or 5 4 or 5 Kerberos enabled
HDFS, Hive 4 or 5 4 Where the replicated data includes a directory that contains a large number of files or subdirectories (several hundred thousand entries), causing out-of-memory errors.

To work around this issue, follow this procedure.

4 or 5 5  
Hive 4 or 5 4 Replicate HDFS Files is disabled. 4 or 5 4 or 5 Over-the-wire encryption is enabled.
Hive 4 or 5 4 Replication can fail if the NameNode fails over during replication. 4 or 5 5, with high availability enabled Replication can fail if the NameNode fails over during replication.
Hive 4 or 5 4 The clusters use different Kerberos realms. 4 or 5 5 An older JDK is deployed. (Upgrade the CDH 4 cluster to use JDK 7 or JDK6u34 to work around this issue.)
Any 4 or 5 4 SSL enabled on Hadoop services. 4 or 5 4 or 5  
Hive 4 or 5 4.2 or higher If the Hive schema contain views. 4 or 5 4  
HDFS 4 or 5 4, with high availability enabled Replications fail if NameNode failover occurs during replication. 4 or 5 5, without high availability Replications fail if NameNode failover occurs during replication.
HDFS 4 or 5 4 or 5 Over the wire encryption is enabled. 4 or 5 4  
HDFS 4 or 5 5 Clusters where there are URL-encoding characters such as % in file and directory names. 4 or 5 4  
Hive 4 or 5 4 or 5 Over the wire encryption is enabled and Replicate HDFS Files is enabled. 4 or 5 4  
Hive 4 or 5 4 or 5 From one cluster to the same cluster. 4 or 5 4 or 5 From one cluster to the same cluster.
HDFS, Hive 4 or 5 5 Where the replicated data includes a directory that contains a large number of files or subdirectories (several hundred thousand entries), causing out-of-memory errors.

To work around this issue, follow this procedure.

4 or 5 4  
HDFS 4 or 5 5 The clusters use different Kerberos realms. 4 or 5 4 An older JDK is deployed. (Upgrade the CDH 4 cluster to use JDK 7 or JDK6u34 to work around this issue.)
Hive 4 or 5 5 Replicate HDFS Files is enabled and the clusters use different Kerberos realms. 4 or 5 4 An older JDK is deployed. (Upgrade the CDH 4 cluster to use JDK 7 or JDK6u34 to work around this issue.)
Any 4 or 5 5 SSL enabled on Hadoop services and YARN. 4 or 5 4 or 5  
Any 4 or 5 5 SSL enabled on Hadoop services. 4 or 5 4  
HDFS 4 or 5 5, with high availability enabled Replications fail if NameNode failover occurs during replication. 4 or 5 4, without high availability Replications fail if NameNode failover occurs during replication.
HDFS, Hive 5 5   4 4  
Hive 5.2 5.2 or lower Replication of Impala UDFs is skipped. 4 or 5 4 or 5  
Workaround for replicated data that includes a directory that contains several hundred thousand files or subdirectories:
  1. On the destination Cloudera Manager instance, go to the HDFS service page.
  2. Click the Configuration tab.
  3. Select Scope > HDFS service name (Service-Wide) and Category > Advanced.
  4. Locate the HDFS Replication Environment Advanced Configuration Snippet (Safety Valve) for hadoop-env.sh property.
  5. Increase the heap size by adding a key-value pair, for instance, HADOOP_CLIENT_OPTS=-Xmx1g. In this example, 1g sets the heap size to 1 GB. This value should be adjusted depending on the number of files and directories being replicated.
  6. Click Save Changes to commit the changes.

HDFS and Hive/Impala Replication To and From Amazon S3 or Microsoft ADLS

Minimum Required Role: User Administrator (also provided by Full Administrator)

To configure Amazon S3 or Microsoft ADLS as a source or destination for HDFS or Hive/Impala replication, you configure AWS Credentials or Azure Credentials. See How to Configure AWS Credentials or Configuring ADLS Access Using Cloudera Manager.

After adding the AWS or Azure credentials, you can click the Replication Schedules link to define a replication schedule. See HDFS Replication or Hive/Impala Replication for details about creating replication schedules. You can also click Close and create the replication schedules later. Select the AWS or Azure Credentials account in the Source or Destination drop-down lists when creating the schedules.

Supported Replication Scenarios for Clusters using Isilon Storage

Note the following when scheduling replication jobs for clusters that use Isilon storage:
  • As of CDH 5.8 and higher, Replication is supported for clusters using Kerberos and Isilon storage on the source or destination cluster, or both. See Configuring Replication with Kerberos and Isilon. Replication between clusters using Isilon storage and Kerberos is not supported in CDH 5.7.
  • Make sure that the hdfs user is a superuser in the Isilon system. If you specify alternate users with the Run As option when creating replication schedules, those users must also be superusers.
  • Cloudera recommends that you use the Isilon root user for replication jobs. (Specify root in the Run As field when creating replication schedules.)
  • Select the Skip checksum checks property when creating replication schedules.
  • Clusters that use Isilon storage do not support snapshots. Snapshots are used to ensure data consistency during replications in scenarios where the source files are being modified. Therefore, when replicating from an Isilon cluster, Cloudera recommends that you do not replicate Hive tables or HDFS files that could be modified before the replication completes.

See Using CDH with Isilon Storage.

BDR Log Retention

By default, Cloudera Manager retains BDR logs for 90 days. You can change the number of days Cloudera Manager retains logs for or disable log retention completely:
  1. In the Cloudera Manager Admin Console, search for the following property: Backup and Disaster Log Retention.
  2. Enter the number of days you want to retain logs for. To disable log retention, enter -1.

You can set up the Backup and Disaster Log Retention property in the HDFS Service Configuration window:

  1. From Cloudera Manager > HDFS > Configuration > Backup and Disaster Log Retention
  2. Enter the required value.