This is the documentation for Cloudera Manager 4.8.4.
Documentation for other versions is available at Cloudera Documentation.

HDFS Replication

HDFS Replication enables you to copy (replicate) your HDFS data from a remote (or local) Peer Cloudera Manager server to your local Cloudera Manager server (the server whose Admin console you are currently logged into) and keep the data sets synchronized. You can add Peers though the Administration > Peers tab (see Designating A Replication Source).

You can also use the Add Replication Source link on the HDFS Replication page to go to the Peers page.

  Note: HDFS replication will not work between a source cluster that has encryption enabled and a target cluster running CDH 4.0. This is because the CDH 4.0 client is used for replication in this case, and it does not support encryption.

Once you have a peer relationship set up with a Cloudera Manager server, you can configure replication of your HDFS data.

  1. From the Services tab, go to the CDH4 HDFS service where you want to host the replicated data.
  2. Click the Replication tab at the top of the page.
  3. Select the HDFS service to be the source of the replicated data. If the peer Cloudera Manager Server has multiple CDH4 HDFS services (for example, if it is managing multiple CDH4 clusters) you will be able to select the HDFS service you want to use as the source. Note that the local CDH4 HDFS service (being managed by the Cloudera Manager server you are logged into) is also available as a replication source. If the peer whose HDFS service you want is not listed, click the Add Peer link to go to the Peers page to add a Cloudera Manager peer. When you select a replication source, the Create Replication pop-up opens.
  4. Enter the path to the directory (or file) you want to replicate (the source).
  5. Enter the path where the target files should be placed.
  6. Select a schedule: You can have it run immediately, run once at a scheduled time in the future, or at regularly scheduled intervals. If you select "Once" or "Recurring" you are presented with fields that let you set the date and time and (if appropriate) the interval between runs.
  7. If you want to modify the parameters of the MapReduce job, click More Options. Here you will be able to select a MapReduce service (if there is more than one in your cluster) and change the following parameters:
    • The MapReduce service to use.
    • The scheduler pool to use.
    • The user that should run the MapReduce job. By default this is hdfs. If you want to run the MR job as a different user, you can enter that here. If you are using Kerberos, you MUST provide a user name here, and it must be one with an ID greater than 1000.
    • An alternative path for the logs.
    • Limits for the number of map slots and for bandwidth per mapper. The defaults are unlimited.
    • Whether to abort the job on an error (default is not to do so). This means that files copied up to that point will remain on the destination, but no additional files will be copied.
    • Whether to skip checksum checks (default is to perform them). If checked, checksum validation will not be performed.
    • Whether to remove deleted files from the target directory if they have been removed on the source. When this option is enabled, files deleted from the target directory are sent to trash if HDFS trash is enabled, or are deleted permanently if trash is not enabled. Further, with this option enabled, if files unrelated to the source exist in the target location, then those files will also be deleted.
    • Whether to preserve the block size, replication count, and permissions as they exist on the source file system, or to use the settings as configured on the target file system. The default is to preserve these settings as on the source.
        Note:

      If you leave the setting to preserve permissions, then you must be running as a superuser. You can use the "Run as" option to ensure that is the case.

    • Whether to generate alerts for various state changes in the replication workflow. You can alert on failure, on start, on success, or when the replication workflow is aborted.
  8. Click Save Schedule to save the replication specs.

When saved, the replication job appears in the Replication list, with relevant information about the source and target locations, and the timestamp of the last run and the next scheduled run (if there is a recurring schedule). A scheduled job will show a calendar icon to the left of the job specification. If it is scheduled to run once, the calendar icon will disappear after the job has run.

To specify additional replication tasks, click the Create Replication button that appears once you have added the first replication task.

Note that only one replication can occur at a time; if another replication job starts before the previous one has finished, the second one is canceled.

You can test the replication task without actually transferring data using the "Dry Run" feature:

  • From the Actions menu for the replication task you want to test, click Dry Run.

From the Actions menu for a replication task, in addition to Dry Run you can also:

  • Edit the job configuration
  • Run the job (immediately)
  • Delete the job
  • Disable/Enable the job (if the job is on a recurring schedule) When a task is disabled, instead of the calendar icon you will see a Stopped icon, and the job entry will appear in gray. Disabling and enabling a job is only available if the job is on a recurring schedule.

Viewing Replication Job Status

While a run is in progress, the calendar icon turns into spinner, and each stage of the replication task is indicated in the message after the replication specification.

  • If the replication is successful, the number of files copied is indicated. If there have been no changes to a file at the source since the previous replication, then that file will not be copied. As a result, after the initial replication run, only a subset of the files may actually be copied, and this will be indicated in the success message.
  • If the replication fails, that will be indicated and the timestamp will appear in Red text.
  • To view more information about completed replication runs, click anywhere in the replication job entry row in the replication list. This displays sub-entries for each past replication run.
  • To view detailed information about a particular past run, click the entry for that replication run. This opens another sub-entry that shows:
    • A result message
    • The start and end time of the replication job.
    • A link to the command details for that replication run.
    • Details about the data that was replicated.
  • When viewing a sub-entry, you can dismiss the sub-entry by clicking anywhere in its parent entry, or by clicking the return arrow icon !at the top left of the sub-entry area.