Copying Data between two Clusters Using distcp

You can use the distcp tool on the destination cluster to initiate the copy job to move the data. Between two clusters running different versions of CDH, run the distcp tool with hftp:// as the source file system and hdfs:// as the destination file system. This uses the HFTP protocol for the source and the HDFS protocol for the destination. The default port for HFTP is 50070 and the default port for HDFS is 8020. Amazon S3 block and native filesystems are also supported, using the s3a:// protocol.

Example of a source URI: hftp://namenode-location:50070/basePath

where namenode-location refers to the CDH 4 NameNode hostname as defined by its configured fs.default.name and 50070 is the NameNode's HTTP server port, as defined by the configured dfs.http.address.

Example of a destination URI: hdfs://nameservice-id/basePath or hdfs://namenode-location

This refers to the CDH 5 NameNode as defined by its configured fs.defaultFS.

The basePath in both the above URIs refers to the directory you want to copy, if one is specifically needed.

Example of an Amazon S3 Block Filesystem URI: s3://accessKeyid:secretkey@bucket/file

Example of an Amazon S3 Native Filesystem URI: s3n://accessKeyid:secretkey@bucket/file

The distcp Command

You can use distcp to copy files between compatible clusters in either direction, from or to the source or destination clusters.

For example, when upgrading, say from a CDH 5.7 cluster to a CDH 5.9, you should run distcp from the CDH 5.13 cluster in this manner:

$ hadoop distcp hftp://cdh57-namenode:50070/ hdfs://CDH59-nameservice/
$ hadoop distcp s3a://bucket/ hdfs://CDH59-nameservice/

You can also use a specific path, such as /hbase to move HBase data, for example:

$ hadoop distcp hftp://cdh57-namenode:50070/hbase hdfs://CDH59-nameservice/hbase
$ hadoop distcp s3a://bucket/file hdfs://CDH59-nameservice/bucket/file

Protocol Support for distcp

The following table lists support for using different protocols with the distcp command on different versions of CDH. In the table, secure means that the cluster is configured to use Kerberos. Copying between a secure cluster and an insecure cluster is only supported from CDH 5.1.3 onward, due to the inclusion of HDFS-6776.

Source Destination Source protocol and configuration Destination protocol and configuration Where to issue distcp command Fallback Configuration Required Status
CDH 4 CDH 4 hdfs or webhdfs, insecure hdfs or webhdfs, insecure Source or destination   ok
CDH 4 CDH 4 hftp, insecure hdfs or webhdfs, insecure Destination   ok
CDH 4 CDH 4 hdfs or webhdfs, secure hdfs or webhdfs, secure Source or destination   ok
CDH 4 CDH 4 hftp, secure hdfs or webhdfs, secure Destination   ok
CDH 4 CDH 5 webhdfs, insecure webhdfs or hdfs, insecure Destination   ok
CDH 4 CDH 5 ( 5.1.3 and newer) webhdfs, insecure webhdfs, secure Destination yes ok
CDH 4 CDH 5 webhdfs or hftp, insecure webhdfs or hdfs, insecure Destination   ok
CDH 4 CDH 5 webhdfs or hftp, secure webhdfs or hdfs, secure Destination   ok
             
CDH 5 CDH 4 webhdfs , insecure webhdfs, insecure source or destination   ok
CDH 5 CDH 4 webhdfs , insecure hdfs, insecure destination   ok
CDH 5 CDH 4 hdfs, insecure webhdfs, insecure source   ok
CDH 5 CDH 4 hftp, insecure hdfs or webhdfs, insecure destination   ok
CDH 5 CDH 4 webhdfs, secure webhdfs, secure source or destination   ok
CDH 5 CDH 4 webhdfs, secure hdfs, insecure destination   ok
CDH 5 CDH 4 hdfs, secure webhdfs, secure source   ok
             
CDH 5 CDH 5 hdfs or webhdfs, insecure hdfs or webhdfs, insecure Source or destination   ok
CDH 5 CDH 5 hftp, insecure hdfs or webhdfs, insecure Destination   ok
CDH 5 CDH 5 hdfs or webhdfs, secure hdfs or webhdfs, secure Source or destination   ok
CDH 5 CDH 5 hftp, secure hdfs or webhdfs, secure Destination   ok
CDH 5 CDH 5 (5.1.3 and newer) hdfs or webhdfs, secure hdfs or webhdfs, insecure Source yes ok
To enable the fallback configuration, for copying between a secure cluster and an insecure one, add the following to the HDFS core-default.xml, by using an advanced configuration snippet if you use Cloudera Manager, or editing the file directly otherwise.
<property>
  <name>ipc.client.fallback-to-simple-auth-allowed</name>
  <value>true</value>
</property>