Copying Data between two Clusters Using distcp

You can use the distcp tool on the destination cluster to initiate the copy job to move the data. Between two clusters running different versions of CDH, run the distcp tool with hftp:// as the source file system and hdfs:// as the destination file system. This uses the HFTP protocol for the source and the HDFS protocol for the destination. The default port for HFTP is 50070 and the default port for HDFS is 8020. Amazon S3 block and native filesystems are also supported, using the s3a:// protocol.

Example of a source URI: hftp://namenode-location:50070/basePath

where namenode-location refers to the CDH 4 NameNode hostname as defined by its configured fs.default.name and 50070 is the NameNode's HTTP server port, as defined by the configured dfs.http.address.

Example of a destination URI: hdfs://nameservice-id/basePath or hdfs://namenode-location

This refers to the CDH 5 NameNode as defined by its configured fs.defaultFS.

The basePath in both the above URIs refers to the directory you want to copy, if one is specifically needed.

Example of an Amazon S3 Block Filesystem URI: s3://accessKeyid:secretkey@bucket/file

Example of an Amazon S3 Native Filesystem URI: s3n://accessKeyid:secretkey@bucket/file

The distcp Command

You can use distcp to copy files between compatible clusters in either direction, from or to the source or destination clusters.

For example, when upgrading, say from a CDH 5.7 cluster to a CDH 5.9, you should run distcp from the CDH 5.13 cluster in this manner:

$ hadoop distcp hftp://cdh57-namenode:50070/ hdfs://CDH59-nameservice/
$ hadoop distcp s3a://bucket/ hdfs://CDH59-nameservice/

You can also use a specific path, such as /hbase to move HBase data, for example:

$ hadoop distcp hftp://cdh57-namenode:50070/hbase hdfs://CDH59-nameservice/hbase
$ hadoop distcp s3a://bucket/file hdfs://CDH59-nameservice/bucket/file

Protocol Support for distcp

The following table lists support for using different protocols with the distcp command on different versions of CDH. In the table, secure means that the cluster is configured to use Kerberos. Copying between a secure cluster and an insecure cluster is only supported from CDH 5.1.3 onward, due to the inclusion of HDFS-6776.

Source Destination Source protocol and configuration Destination protocol and configuration Where to issue distcp command Fallback Configuration Required Status
CDH 4 CDH 4 hdfs or webhdfs, insecure hdfs or webhdfs, insecure Source or destination   ok
CDH 4 CDH 4 hftp, insecure hdfs or webhdfs, insecure Destination   ok
CDH 4 CDH 4 hdfs or webhdfs, secure hdfs or webhdfs, secure Source or destination   ok
CDH 4 CDH 4 hftp, secure hdfs or webhdfs, secure Destination   ok
CDH 4 CDH 5 webhdfs, insecure webhdfs or hdfs, insecure Destination   ok
CDH 4 CDH 5 ( 5.1.3 and newer) webhdfs, insecure webhdfs, secure Destination yes ok
CDH 4 CDH 5 webhdfs or hftp, insecure webhdfs or hdfs, insecure Destination   ok
CDH 4 CDH 5 webhdfs or hftp, secure webhdfs or hdfs, secure Destination   ok
CDH 4 CDH 5 hdfs or webhdfs, insecure webhdfs, insecure Source   ok
             
CDH 5 CDH 4 webhdfs , insecure webhdfs, insecure Source or destination   ok
CDH 5 CDH 4 webhdfs , insecure hdfs, insecure Destination   ok
CDH 5 CDH 4 hdfs, insecure webhdfs, insecure Source   ok
CDH 5 CDH 4 hftp, insecure hdfs or webhdfs, insecure Destination   ok
CDH 5 CDH 4 webhdfs, secure webhdfs, secure Source or destination   ok
CDH 5 CDH 4 webhdfs, secure hdfs, insecure Destination   ok
CDH 5 CDH 4 hdfs, secure webhdfs, secure Source   ok
             
CDH 5 CDH 5 hdfs or webhdfs, insecure hdfs or webhdfs, insecure Source or destination   ok
CDH 5 CDH 5 hftp, insecure hdfs or webhdfs, insecure Destination   ok
CDH 5 CDH 5 hdfs or webhdfs, secure hdfs or webhdfs, secure Source or destination   ok
CDH 5 CDH 5 hftp, secure hdfs or webhdfs, secure Destination   ok
CDH 5 CDH 5 hdfs or webhdfs, secure webhdfs, insecure Source yes ok
CDH 5 CDH 5 hdfs or webhdfs, insecure hdfs or webhdfs, secure Destination yes ok
To enable the fallback configuration, for copying between a secure cluster and an insecure one, add the following to the HDFS core-default.xml, by using an advanced configuration snippet if you use Cloudera Manager, or editing the file directly otherwise.
<property>
  <name>ipc.client.fallback-to-simple-auth-allowed</name>
  <value>true</value>
</property>

distcp between Secure Clusters in Distinct Kerberos Realms

Specify the Destination Parameters in krb5.conf

Edit the krb5.conf file on the client (where the distcp job will be submitted) to include the destination hostname and realm.
[realms]
HADOOP.QA.domain.COM = { kdc = kdc.domain.com:88 admin_server = admin.test.com:749
default_domain = domain.com supported_enctypes = arcfour-hmac:normal des-cbc-crc:normal
des-cbc-md5:normal des:normal des:v4 des:norealm des:onlyrealm des:afs3 } 

[domain_realm]
.domain.com = HADOOP.test.domain.COM
domain.com = HADOOP.test.domain.COM
test03.domain.com = HADOOP.QA.domain.COM

(If SSL is enabled) Specify Truststore Properties

The following properties must be configured in the ssl-client.xml file on the client submitting the distcp job to establish trust between the target and destination clusters.
<property>
<name>ssl.client.truststore.location</name>
<value>path_to_truststore</value>
</property>

<property>
<name>ssl.client.truststore.password</name>
<value>XXXXXX</value>
</property>

<property>
<name>ssl.client.truststore.type</name>
<value>jks</value>
</property>

Set HADOOP_CONF to the Destination Cluster

Set the HADOOP_CONF path to be the destination environment. If you are not using HFTP, set the HADOOP_CONF path to the source environment instead.

Launch Distcp

Kinit on the client and launch the distcp job.
hadoop distcp hdfs://test01.domain.com:8020/user/alice hdfs://test02.domain.com:8020/user/alice
If launching distcp fails, force Kerberos to use TCP instead of UDP by adding the following parameter to the krb5.conf file on the client.
[libdefaults]
udp_preference_limit = 1