This is the documentation for CDH 5.0.x. Documentation for other versions is available at Cloudera Documentation.

Using DistCp to Migrate Data between two Clusters

You can use the DistCp tool on the CDH 5 cluster to initiate the copy job to move the data. Between two clusters running different versions of CDH, run the DistCp tool with hftp:// as the source file system and hdfs:// as the destination file system.

Example of a source URI: hftp://namenode-location:50070/basePath

where namenode-location refers to the CDH 4 NameNode hostname as defined by its configured fs.default.name and 50070 is the NameNode's HTTP server port, as defined by the configured dfs.http.address.

Example of a destination URI: hdfs://nameservice-id/basePath or hdfs://namenode-location

This refers to the CDH 5 NameNode as defined by its configured fs.defaultFS.

The basePath in both the above URIs refers to the directory you want to copy, if one is specifically needed.

The DistCp Command

For more help, and to see all the options available on the DistCp tool, use the following command to see the built-in help:

$ hadoop distcp
Run the DistCp copy by issuing a command such as the following on the CDH 5 cluster:
  Important: Run the following DistCp commands on the destination cluster only, in this example, the CDH 5 cluster.
$ hadoop distcp hftp://cdh4-namenode:50070/ hdfs://CDH5-nameservice/

Or use a specific path, such as /hbase to move HBase data, for example:

$ hadoop distcp hftp://cdh4-namenode:50070/hbase hdfs://CDH5-nameservice/hbase

DistCp will then submit a regular MapReduce job that performs a file-by-file copy.

Page generated September 3, 2015.