HDFS Replication

Minimum Required Role: BDR Administrator (also provided by Full Administrator)

HDFS replication enables you to copy (replicate) your HDFS data from one HDFS service to another, synchronizing the data set on the destination service with the data set on the source service, based on a specified replication schedule. You can also replicate HDFS data to and from Amazon S3 or Microsoft ADLS. The destination service must be managed by the Cloudera Manager Server where the replication is being set up, and the source service can be managed by that same server or by a peer Cloudera Manager Server. You can also replicate HDFS data within a cluster by specifying different source and destination directories.

Remote BDR Replication automatically copies HDFS metadata to the destination cluster as it copies files. HDFS metadata need only be backed up locally. For information about how to backup HDFS metadata locally, see Backing Up and Restoring NameNode Metadata.

Source Data
Network Latency and Replication
Performance and Scalability Limitations
Replication with Sentry Enabled
Guidelines for Snapshot Diff-based Replication
Replicating from Insecure to Secure Clusters
Configuring Replication of HDFS Data
Limiting Replication Hosts
Viewing Replication Schedules
Viewing Replication History
HDFS Replication To and From Cloud Storage

Source Data

While a replication runs, ensure that the source directory is not modified. A file added during replication does not get replicated. If you delete a file during replication, the replication fails.

Additionally, ensure that all files in the directory are closed. Replication fails if source files are open. If you cannot ensure that all source files are closed, you can configure the replication to continue despite errors. Uncheck the Abort on Error option for the HDFS replication. For more information, see Configuring Replication of HDFS Data

After the replication completes, you can view the log for the replication to identify opened files. Ensure these files are closed before the next replication occurs.

Network Latency and Replication

High latency among clusters can cause replication jobs to run more slowly, but does not cause them to fail. For best performance, latency between the source cluster NameNode and the destination cluster NameNode should be less than 80 milliseconds. (You can test latency using the Linux ping command.) Cloudera has successfully tested replications with latency of up to 360 milliseconds. As latency increases, replication performance degrades.

Performance and Scalability Limitations

HDFS replication has the following limitations:

Maximum number of files for a single replication job: 100 million.
Maximum number of files for a replication schedule that runs more frequently than once in 8 hours: 10 million.
The throughput of the replication job depends on the absolute read and write throughput of the source and destination clusters.
Regular rebalancing of your HDFS clusters is required for efficient operation of replications. See HDFS Balancers.

Replication with Sentry Enabled

If the cluster has Sentry enabled and you are using BDR to replicate files or tables and their permissions, configuration changes to HDFS are required.

The configuration changes are required due to how HDFS manages ACLs. When a user reads ACLs, HDFS provides the ACLs configured in the External Authorization Provider, which is Sentry. If Sentry is not available or it does not manage authorization of the particular resource, such as the file or directory, then HDFS falls back to its own internal ACLs. But when ACLs are written to HDFS, HDFS always writes these internal ACLs even when Sentry is configured. This causes HDFS metadata to be polluted with Sentry ACLs. It can also cause a replication failure in replication when Sentry ACLs are not compatible with HDFS ACLs.

To prevent issues with HDFS and Sentry ACLs, complete the following steps:

Create a user account that is only used for BDR jobs since Sentry ACLs will be bypassed for this user.
For example, create a user named bdr-only-user.
Configure HDFS on the source cluster:
1. In the Cloudera Manager Admin Console, select Clusters > <HDFS service>.
2. Select Configuration and search for the following property: NameNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml.
3. Add the following property:
  Name: Use the following property name: dfs.namenode.inode.attributes.provider.bypass.users
  
  Value: Provide the following information: <username>, <username>@<RealmName>
  
  Replace <username> with the user you created in step 1 and <RealmName> with the name of the Kerberos realm.
  For example, the user bdr-only-user on the realm elephant requires the following value:
```
bdr-only-user, bdr-only-user@ElephantRealm
```
  Description: This field is optional.
4. Restart the NameNode.
Repeat step 2 on the destination cluster.
When you create a replication schedule, specify the user you created in step 1 in the Run As Username and Run on Peer as Username (if available) fields.
Note: The Run As Username field is used to launch MapReduce job for copying data. Run on Peer as Username field is used to run copy listing on source, if different than Run as Username.

Guidelines for Snapshot Diff-based Replication

By default, BDR uses snapshot differences ("diff") to improve performance by comparing HDFS snapshots and only replicating the files that are changed in the source directory. To use this feature, follow these guidelines:

The source and target clusters must be managed by Cloudera Manager 5.15.0 or higher. If the destination is Amazon S3 or Microsoft ADLS, the source cluster must be Managed by Cloudera Manager 5.15.0 or higher. Snapshot diff-based restore to S3 or ADLS is not supported
The source and target clusters run CDH 5.15.0 or higher, 5.14.2 or higher, or 5.13.3 or higher.
Verify that HDFS snapshots are immutable on the source and destination clusters.
In the Cloudera Manager Admin Console, go to Clusters > <HDFS service> > Configuration and search for Enable Immutable Snapshots.
Do not use snapshot diff for globbed paths. It is not optimized for globbed paths.
Set the snapshot root directory as low in the hierarchy as possible.
To use the Snapshot diff feature, the user who is configured to run the job, needs to be either a super user or the owner of the snapshottable root, because the run-as-user must have the permission to list the snapshots.
Decide if you want BDR to abort on a snapshot diff failure or continue the replication. If you choose to configure BDR to continue the replication when it encounters an error, BDR performs a complete replication. Note that continuing the replication can result in a longer duration since a complete replication is performed.
BDR performs a complete replication when one or more of the following change: Delete Policy, Preserve Policy, Target Path, or Exclusion Path.
Paths from both source and destination clusters in the replication schedule must be under a snapshottable root or should be snapshottable for the schedule to run using snapshot diff.
Maintain a maximum of one million changes in a snapshot diff for an optimum performance of the snapshot delete operation.
The time taken by a NameNode to delete a snapshot is proportional to the number of changes between the current snapshot and the previous snapshot. The changes include addition, deletion, and updation of files. If a snapshot contains more than a million changes, the snapshot delete operation might prevent the NameNode from processing other requests, which may result in premature failover thereby destabilising the cluster.

Replicating from Insecure to Secure Clusters

You can use BDR to replicate data from an insecure cluster, one that does not use Kerberos authentication, to a secure cluster, a cluster that uses Kerberos. Note that the reverse is not true. BDR does not support replicating from a secure cluster to an insecure cluster.

To perform the replication, the destination cluster must be managed by Cloudera Manager 6.1.0 or higher. The source cluster must run Cloudera Manager 5.14.0 or higher in order to be able to replicate to Cloudera Manager 6. For more information about supported replication scenarios, see Supported Replication Scenarios.

To enable replication from an insecure cluster to a secure cluster, you need a user that exists on all the hosts on both the source cluster and destination cluster. Specify this user in the Run As Username field when you create a replication schedule.

The following steps describe how to add a user:

On a host in the source or destination cluster, add a user with the following command:
```
sudo -u hdfs hdfs dfs -mkdir -p /user/<username>
```
For example, the following command creates a user named milton:
```
sudo -u hdfs hdfs dfs -mkdir -p /user/milton
```
Set the permissions for the user directory with the following command:
```
sudo -u hdfs hdfs dfs -chown <username> /user/username
```
For example, the following command makes milton the owner of the milton directory:
```
sudo -u hdfs hdfs dfs -chown milton /user/milton
```
Create the supergroup group for the user you created in step 1 with the following command:
```
groupadd supergroup
```
Add the user you created in step 1 to the group you created:
```
usermod -G supergroup <username>
```
For example, add milton to the group named supergroup:
```
usermod -G supergroup milton
```
Repeat this process for all hosts in the source and destination clusters so that the user and group exists on all of them.

After you complete this process, specify the user you created in the Run As Username field when you create a replication schedule.

Configuring Replication of HDFS Data

Verify that your cluster conforms to one of the Supported Replication Scenarios.
If you are using different Kerberos principals for the source and destination clusters, add the destination principal as a proxy user on the source cluster. For example, if you are using the hdfssrc principal on the source cluster and the hdfsdest principal on the destination cluster, add the following properties to the HDFS service Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml property on the source cluster:
```
<property>
    <name>hadoop.proxyuser.hdfsdest.groups</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.hdfsdest.hosts</name>
    <value>*</value>
</property>
```
Deploy the client configuration and restart all services on the source cluster.
If the source cluster is managed by a different Cloudera Manager server than the destination cluster, configure a peer relationship. If the source or destination is S3 or ADLS, you must configure AWS credentials or configure ADLS access.
Do one of the following:
1. Select Backup > Replication Schedules
2. Click Create Schedule > HDFS Replication.
or
1. Select Clusters > <HDFS service>.
2. Select Quick Links > Replication.
3. Click Create Schedule > HDFS Replication.
The Create HDFS Replication dialog box displays, and opens displaying the General tab. Click the Peer or AWS Credentials link if your replication job requires them and you need to create these entities.
Select the General tab to configure the following:
1. Click the Name field and add a unique name for the replication schedule.
2. Click the Source field and select the source HDFS service. You can select HDFS services managed by a peer Cloudera Manager Server, local HDFS services (managed by the Cloudera Manager Server for the Admin Console you are logged into), or you can select AWS Credentials or Azure Credentials.
3. Enter the Source Path to the directory (or file) you want to replicate.
  For replication from Amazon S3, enter the path using the following form:
```
s3a://bucket name/path
```
  For replication from ADLS Gen 1, enter the path using the following form:
```
adl://<accountname>.azuredatalakestore.net/<path>
```
  For replication from ADLS Gen 2, enter the path using the following form:
```
abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/
```
  You can also use a glob path to specify more than one path for replication.
4. Click the Destination field and select the destination HDFS service from the HDFS services managed by the Cloudera Manager Server for the Admin Console you are logged into, or select AWS Credentials.
5. Enter the Destination Path where the source files should be saved.
6. For replication to S3, enter the path using the following form:
```
s3a://bucket name/path
```
  For replication to ADLS Gen 1, enter the path using the following form:
```
adl://<accountname>.azuredatalakestore.net/<path>
```
  For replication to ADLS Gen 2, enter the path using the following form:
```
abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/
```
7. Select a Schedule:
  - Immediate - Run the schedule Immediately.
  - Once - Run the schedule one time in the future. Set the date and time.
  - Recurring - Run the schedule periodically in the future. Set the date, time, and interval between runs.
8. Enter the user to run the replication job in the Run As Username field. By default this is hdfs. If you want to run the job as a different user, enter the user name here. If you are using Kerberos, you must provide a user name here, and it must be one with an ID greater than 1000. (You can also configure the minimum user ID number with the min.user.id property in the YARN or MapReduce service.) Verify that the user running the job has a home directory, /user/username, owned by username:supergroup in HDFS. This user must have permissions to read from the source directory and write to the destination directory.
  Note the following:
  - The User must not be present in the list of banned users specified with the Banned System Users property in the YARN configuration (Go to the YARN service, select Configuration tab and search for the property). For security purposes, the hdfs user is banned by default from running YARN containers.
  - The requirement for a user ID that is greater than 1000 can be overridden by adding the user to the "white list" of users that is specified with the Allowed System Users property. (Go to the YARN service, select Configuration tab and search for the property.)
Select the Resources tab to configure the following:
- Scheduler Pool – (Optional) Enter the name of a resource pool in the field. The value you enter is used by the MapReduce Service you specified when Cloudera Manager executes the MapReduce job for the replication. The job specifies the value using one of these properties:
  - MapReduce – Fair scheduler: mapred.fairscheduler.pool
  - MapReduce – Capacity scheduler: queue.name
  - YARN – mapreduce.job.queuename
- Maximum Map Slots - Limits for the number of map slots per mapper. The default value is 20.
- Maximum Bandwidth - Limits for the bandwidth per mapper. The default is 100 MB.
- Replication Strategy - Whether file replication tasks should be distributed among the mappers statically or dynamically. (The default is Dynamic.) Static replication distributes file replication tasks among the mappers up front to achieve a uniform distribution based on the file sizes. Dynamic replication distributes file replication tasks in small sets to the mappers, and as each mapper completes its tasks, it dynamically acquires and processes the next unallocated set of tasks.
Select the Advanced Options tab, to configure the following:
- Add Exclusion click the link to exclude one or more paths from the replication.
  The Regular Expression-Based Path Exclusion field displays, where you can enter a regular expression-based path. When you add an exclusion, include the snapshotted relative path for the regex. For example, to exclude the /user/bdr directory, use the following regular expression, which includes the snapshots for the bdr directory:
```
.*/user/\.snapshot/.+/bdr.*
```
  To exclude top-level directories from replication in a globbed source path, you can specify the relative path for the regex without including .snapshot in the path. For example, to exclude the bdr directory from replication, use the following regular expression:
```
.*/user+/bdr.*
```
  You can add more than one regular expression to exclude.
- MapReduce Service - The MapReduce or YARN service to use.
- Log path - An alternate path for the logs.
- Description - A description of the replication schedule.
- Error Handling You can select the following:
  - Skip Checksum Checks - Whether to skip checksum checks on the copied files. If checked, checksums are not validated. Checksums are checked by default.
    Important:
    You must skip checksum checks to prevent replication failure due to non-matching checksums in the following cases:
    
    Replications from an encrypted zone on the source cluster to an encrypted zone on a destination cluster.
    
    Replications from an encryption zone on the source cluster to an unencrypted zone on the destination cluster.
    
    Replications from an unencrypted zone on the source cluster to an encrypted zone on the destination cluster.
    Checksums are used for two purposes:
    
    To skip replication of files that have already been copied. If Skip Checksum Checks is selected, the replication job skips copying a file if the file lengths and modification times are identical between the source and destination clusters. Otherwise, the job copies the file from the source to the destination.
    
    To redundantly verify the integrity of data. However, checksums are not required to guarantee accurate transfers between clusters. HDFS data transfers are protected by checksums during transfer and storage hardware also uses checksums to ensure that data is accurately stored. These two mechanisms work together to validate the integrity of the copied data.
  - Skip Listing Checksum Checks - Whether to skip checksum check when comparing two files to determine whether they are same or not. If skipped, the file size and last modified time are used to determine if files are the same or not. Skipping the check improves performance during the mapper phase. Note that if you select the Skip Checksum Checks option, this check is also skipped.
  - Abort on Error - Whether to abort the job on an error. If selected, files copied up to that point remain on the destination, but no additional files are copied. Abort on Error is off by default.
  - Abort on Snapshot Diff Failures - If a snapshot diff fails during replication, BDR uses a complete copy to replicate data. If you select this option, the BDR aborts the replication when it encounters an error instead.
- Preserve - Whether to preserve the block size, replication count, permissions (including ACLs), and extended attributes (XAttrs) as they exist on the source file system, or to use the settings as configured on the destination file system. By default source system settings are preserved. When Permission is checked, and both the source and destination clusters support ACLs, replication preserves ACLs. Otherwise, ACLs are not replicated. When Extended attributes is checked, and both the source and destination clusters support extended attributes, replication preserves them. (This option only displays when both source and destination clusters support extended attributes.)
  If you select one or more of the Preserve options and you are replicating to S3 or ADLS, the values all of these items are saved in meta data files on S3 or ADLS. When you replicate from S3 or ADls to HDFS, you can select which of these options you want to preserve.
  
  Note: To preserve permissions to HDFS, you must be running as a superuser on the destination cluster. Use the "Run As Username" option to ensure that is the case.
  
  See Replication of Encrypted Data and HDFS Transparent Encryption.
- Delete Policy - Whether files that were deleted on the source should also be deleted from the destination directory. This policy also determines the handling of files in the destination location that are unrelated to the source. Options include:
  - Keep Deleted Files - Retains the destination files even when they no longer exist at the source. (This is the default.).
  - Delete to Trash - If the HDFS trash is enabled, files are moved to the trash folder. (Not supported when replicating to S3 or ADLS.)
  - Delete Permanently - Uses the least amount of space; use with caution.
- Alerts - Whether to generate alerts for various state changes in the replication workflow. You can alert on failure, on start, on success, or when the replication workflow is aborted.
Click Save Schedule.
The replication task now appears as a row in the Replications Schedule table. (It can take up to 15 seconds for the task to appear.)
If you selected Immediate in the Schedule field, the replication job begins running when you click Save Schedule.

To specify additional replication tasks, select Create > HDFS Replication.

Limiting Replication Hosts

You can limit which hosts can run replication processes by specifying a whitelist of hosts. For example, you may not want a host with the Gateway role to run a replication job since the process is resource intensive.

To configure the hosts used for HDFS replication:

Click Clusters > <HDFS service> > Configuration.
Type HDFS Replication in the search box.
Locate the HDFS Replication Environment Advanced Configuration Snippet (Safety Valve) property.
Add the HOST_WHITELIST property. Enter a comma-separated list of DataNode hostnames to use for HDFS replication. For example:
```
HOST_WHITELIST=host-1.mycompany.com,host-2.mycompany.com
```
Enter a Reason for change, and then click Save Changes to commit the changes.

Viewing Replication Schedules

The Replications Schedules page displays a row of information about each scheduled replication job. Each row also displays recent messages regarding the last time the Replication job ran.

Replication Schedules Table

Only one job corresponding to a replication schedule can occur at a time; if another job associated with that same replication schedule starts before the previous one has finished, the second one is canceled.

You can limit the replication jobs that are displayed by selecting filters on the left. If you do not see an expected schedule, adjust or clear the filters. Use the search box to search the list of schedules for path, database, or table names.

The Replication Schedules columns are described in the following table.

Replication Schedules Table
Column	Description
ID	An internally generated ID number that identifies the schedule. Provides a convenient way to identify a schedule. Click the ID column label to sort the replication schedule table by ID.
Name	The unique name you specify when you create a schedule.
Type	The type of replication scheduled, either HDFS or Hive.
Source	The source cluster for the replication.
Destination	The destination cluster for the replication.
Throughput	Average throughput per mapper/file of all the files written. Note that throughput does not include the following information: the combined throughput of all mappers and the time taken to perform a checksum on a file after the file is written.
Progress	The progress of the replication.
Last Run	The date and time when the replication last ran. Displays None if the scheduled replication has not yet been run. Click the date and time link to view the Replication History page for the replication. Displays one of the following icons: - Successful. Displays the date and time of the last run replication. - Failed. Displays the date and time of a failed replication. - None. This scheduled replication has not yet run. - Running. Displays a spinner and bar showing the progress of the replication. Click the Last Run column label to sort the Replication Schedules table by the last run date.
Next Run	The date and time when the next replication is scheduled, based on the schedule parameters specified for the schedule. Hover over the date to view additional details about the scheduled replication. Click the Next Run column label to sort the Replication Schedules table by the next run date.
Objects	Displays on the bottom line of each row, depending on the type of replication: Hive - A list of tables selected for replication. HDFS - A list of paths selected for replication. For example:
Actions	The following items are available from the Action button: Show History - Opens the Replication History page for a replication. See Viewing Replication History. Edit Configuration - Opens the Edit Replication Schedule page. Dry Run - Simulates a run of the replication task but does not actually copy any files or tables. After a Dry Run, you can select Show History, which opens the Replication History page where you can view any error messages and the number and size of files or tables that would be copied in an actual replication. Click Collect Diagnostic Data to open the Send Diagnostic Data screen, which allows you to collect replication-specific diagnostic data for the last 10 runs of the schedule: Select Send Diagnostic Data to Cloudera to automatically send the bundle to Cloudera Support. You can also enter a ticket number and comments when sending the bundle. Click Collect and Send Diagnostic Data to generate the bundle and open the Replications Diagnostics Command screen. When the command finishes, click Download Result Data to download a zip file containing the bundle. Run Now - Runs the replication task immediately. Disable \| Enable - Disables or enables the replication schedule. No further replications are scheduled for disabled replication schedules. Delete - Deletes the schedule. Deleting a replication schedule does not delete copied files or tables.

While a job is in progress, the Last Run column displays a spinner and progress bar, and each stage of the replication task is indicated in the message beneath the job's row. Click the Command Details link to view details about the execution of the command.
If the job is successful, the number of files copied is indicated. If there have been no changes to a file at the source since the previous job, then that file is not copied. As a result, after the initial job, only a subset of the files may actually be copied, and this is indicated in the success message.
If the job fails, the icon displays.
To view more information about a completed job, select Actions > Show History. See Viewing Replication History.

Enabling, Disabling, or Deleting A Replication Schedule

When you create a new replication schedule, it is automatically enabled. If you disable a replication schedule, it can be re-enabled at a later time.

To enable, disable, or delete a replication schedule:

Click Actions > Enable|Disable|Delete in the row for a replication schedule.

To enable, disable, or delete multiple replication schedules:

Select one or more replication schedules in the table by clicking the check box the in the left column of the table.
Click Actions for Selected > Enable|Disable|Delete.

Viewing Replication History

You can view historical details about replication jobs on the Replication History page.

To view the history of a replication job:

Select Backup > Replication Schedules to go to the Replication Schedules page.
Locate the row for the job.
Click Actions > Show History.

Replication History Screen (HDFS)

Replication History Screen (Hive, Failed Replication)

The Replication History page displays a table of previously run replication jobs with the following columns:

Replication History Table
Column	Description
Start Time	Time when the replication job started. Expand the display and show details of the replication. In this screen, you can: Click the View link to open the Command Details page, which displays details and messages about each step in the execution of the command. Expand the display for a Step to: View the actual command string. View the Start time and duration of the command. Click the Context link to view the service status page relevant to the command. Select one of the tabs to view the Role Log, stdout, and stderr for the command. See Viewing Running and Recent Commands. Click Collect Diagnostic Data to open the Send Diagnostic Data screen, which allows you to collect replication-specific diagnostic data for this run of the schedule: Select Send Diagnostic Data to Cloudera to automatically send the bundle to Cloudera Support. You can also enter a ticket number and comments when sending the bundle. Click Collect and Send Diagnostic Data to generate the bundle and open the Replications Diagnostics Command screen. When the command finishes, click Download Result Data to download a zip file containing the bundle. (HDFS only) Link to view details on the MapReduce Job used for the replication. See Viewing and Filtering MapReduce Activities. (Dry Run only) View the number of Replicable Files. Displays the number of files that would be replicated during an actual replication. (Dry Run only) View the number of Replicable Bytes. Displays the number of bytes that would be replicated during an actual replication. Link to download a CSV file containing a Replication Report. This file lists the databases and tables that were replicated. View the number of Errors that occurred during the replication. View the number of Impala UDFs replicated. (Displays only for Hive/Impala replications where Replicate Impala Metadata is selected.) Click the link to download a CSV file containing a Download Listing. This file lists the files and directories that were replicated. Click the link to download a CSV file containing Download Status. If a user was specified in the Run As Username field when creating the replication job, the selected user displays. View messages returned from the replication job.
Duration	Amount of time the replication job took to complete.
Outcome	Indicates success or failure of the replication job.
Files Expected	Number of files expected to be copied, based on the parameters of the replication schedule.
Files Copied	Number of files actually copied during the replication.
Tables	(Hive only) Number of tables replicated.
Files Failed	Number of files that failed to be copied during the replication.
Files Deleted	Number of files that were deleted during the replication.
Files Skipped	Number of files skipped during the replication. The replication process skips files that already exist in the destination and have not changed.

HDFS Replication To and From Cloud Storage

You can use Cloudera Manager to replicate HDFS data to and from S3 or ADLS, however you cannot replicate data from one S3 or ADLS instance to another using Cloudera Manager. You must have the appropriate credentials to access the S3 or ADLS account. Additionally, you must create buckets in S3 or data lake store in ADLS.

When you replicate data to cloud storage with BDR, BDR also backs up file metadata, including extended attributes and ACLs.

To configure HDFS replication to S3 or ADLS:

Create AWS Credentials or Azure Credentials. See How to Configure AWS Credentials or Configuring ADLS Access Using Cloudera Manager.
Create an HDFS Replication Schedule. See HDFS Replication.

Ensure that the following basic permissions are available to provide read-write access to S3 through the S3A connector:

s3:Get*
s3:Delete*
s3:Put*
s3:ListBucket
s3:ListBucketMultipartUploads
s3:AbortMultipartUpload

Designating a Replication Source

Monitoring the Performance of HDFS Replications