Replication of Encrypted Data

Beginning with CDH 5.3, HDFS supports encryption of data at rest (including data accessed through Hive). This section describes the behavior with respect to encryption during replication, depending on whether or not the source and target are in encryption zones, and the procedure for encrypting data in transit between the source and target clusters.

Encrypting Data in Transit Between Clusters

A source directory and a destination directory may or may not be in an encryption zone. For more information about HDFS encryption zones, see HDFS Data At Rest Encryption. There are four possible scenarios with respect to whether or not replicated data is encrypted:
  • Source and target directory are both in an encryption zone - In this case, the data on the target directory is encrypted.
  • Source directory is not encrypted, possibly because the source cluster uses a version of CDH earlier than 5.2 (the first version to support encryption zones) but the target directory is in an encryption zone - In this case, the data on the target directory is encrypted.
  • Source directory is in an encryption zone and target directory is not - In this case, the data on the target directory is not encrypted.
  • Neither the source nor the target directory are in encryption zones - In this case, the data on the target directory is not encrypted.

Even when the source and target directories are both in encryption zones, the data is decrypted as it is read from the source cluster (using the key for the source encryption zone) and encrypted again when it is written to the target cluster (using the key for the target encryption zone). By default, it is passed over the wire as plain text.

During replication, data travels from the source cluster to the destination cluster using distcp. By default, the data in transit is in plain text. To encrypt data on the wire between the source and target using SSL/TLS: