HDFS Replication

Minimum Required Role: BDR Administrator (also provided by Full Administrator)

HDFS replication enables you to copy (replicate) your HDFS data from one HDFS service to another, keeping the data set on the target service synchronized with the data set on the source service, based on a user-specified replication schedule. The target service needs to be managed by the Cloudera Manager Server where the replication is being set up, and the source service could either be managed by that same server or by a peer Cloudera Manager Server.

Configuring Replication of HDFS Data

  1. Verify that your cluster conforms to the supported replication scenarios.
  2. If the source cluster is managed by a different Cloudera Manager server from the target cluster, configure a peer relationship.
  3. Do one of the following:
    • From the Backup tab, select Replications.
    • From the Clusters tab, go to the HDFS service and select the Replication tab.
    The Schedules tab of the Replications page displays.
  4. Click the Schedule HDFS Replication link.
  5. Select the source HDFS service from the HDFS services managed by the peer Cloudera Manager Server or the HDFS services managed by the Cloudera Manager Server whose Admin Console you are logged into.
  6. Enter the path to the directory (or file) you want to replicate (the source).
  7. Select the target HDFS service from the HDFS services managed by the Cloudera Manager Server whose Admin Console you are logged into.
  8. Enter the path where the target files should be placed.
  9. Select a schedule. You can have it run immediately, run once at a scheduled time in the future, or at regularly scheduled intervals. If you select Once or Recurring you are presented with fields that let you set the date and time and (if appropriate) the interval between runs.
  10. If you want to modify the parameters of the job, click More Options. Here you will beable to change the following parameters:
    • MapReduce Service - The MapReduce or YARN service to use.
    • Scheduler Pool - The scheduler pool to use.
    • Run as - The user that should run the job. By default this is hdfs. If you want to run the job as a different user, you can enter that here. If you are using Kerberos, you must provide a user name here, and it must be one with an ID greater than 1000. Verify that the user running the job has a home directory, /user/<username>, owned by username:supergroup in HDFS.
    • Log path - An alternative path for the logs.
    • Maximum map slots and Maximum bandwidth - Limits for the number of map slots and for bandwidth per mapper. The defaults are unlimited.
    • Abort on error - Whether to abort the job on an error (default is not to do so). This means that files copied up to that point will remain on the destination, but no additional files will be copied.
    • Skip Checksum Checks - Whether to skip checksum checks (the default is to perform them). If checked, checksum validation will not be performed.
    • Remove deleted files - Whether to remove deleted files from the target directory if they have been removed on the source. When this option is enabled, files deleted from the target directory are sent to trash if HDFS trash is enabled, or are deleted permanently if trash is not enabled. Further, with this option enabled, if files unrelated to the source exist in the target location, then those files will also be deleted.
    • Preserve - Whether to preserve the block size, replication count, and permissions as they exist on the source file system, or to use the settings as configured on the target file system. The default is to preserve these settings as on the source.
    • Alerts - Whether to generate alerts for various state changes in the replication workflow. You can alert on failure, on start, on success, or when the replication workflow is aborted.
  11. Click Save Schedule.

To specify additional replication tasks, select Create > HDFS Replication.

A replication task appears in the All Replications list, with relevant information about the source and target locations, the timestamp of the last job, and the next scheduled job (if there is a recurring schedule). A scheduled job will show a calendar icon to the left of the task specification. If the task is scheduled to run once, the calendar icon will disappear after the job has run.

Only one job corresponding to a replication schedule can occur at a time; if another job associated with that same replication schedule starts before the previous one has finished the second one is canceled.

From the Actions menu for a replication task, you can:
  • Test the replication task without actually transferring data ("Dry Run" )
  • Edit the task configuration
  • Run the task (immediately)
  • Delete the task
  • Disable or enable the task (if the task is on a recurring schedule). When a task is disabled, instead of the calendar icon you will see a Stopped icon, and the job entry will appear in gray.

Viewing Replication Job Status

  • While a job is in progress, the calendar icon turns into spinner, and each stage of the replication task is indicated in the message after the replication specification.
  • If the job is successful, the number of files copied is indicated. If there have been no changes to a file at the source since the previous job, then that file will not be copied. As a result, after the initial job, only a subset of the files may actually be copied, and this will be indicated in the success message.
  • If the job fails, a icon displays.
  • For Dry Run jobs, the Dry Run action tests the replication flow. By default, up to 1024 replicable source files are tested. The actual number of files tested is equal to 1024 divided by the number of mappers, converted to an integer with a minimum value of 1.
  • To view more information about a completed job, click the task row in the Replications list. This displays sub-entries for each past job.
  • To view detailed information about a past job, click the entry for that job. This opens another sub-entry that shows:
    • A result message
    • The start and end time of the job.
    • A link to the command details for that replication job.
    • Details about the data that was replicated.
  • When viewing a sub-entry, you can dismiss the sub-entry by clicking anywhere in its parent entry, or by clicking the return arrow icon at the top left of the sub-entry area.