HDFS Data At Rest Encryption

HDFS Encryption implements transparent, end-to-end encryption of data read from and written to HDFS, without requiring changes to application code. Because the encryption is end-to-end, data can be encrypted and decrypted only by the client. HDFS does not store or have access to unencrypted data or encryption keys. This supports both, at-rest encryption (data on persistent media, such as a disk) and in-transit encryption (data traveling over a network).

Continue reading:

Use Cases

Data encryption is required by a number of different government, financial, and regulatory entities. For example, the healthcare industry has HIPAA regulations, the card payment industry has PCI DSS regulations, and the United States government has FISMA regulations. Transparent encryption in HDFS makes it easier for organizations to comply with these regulations. Encryption can also be performed at the application-level, but by integrating it into HDFS, existing applications can operate on encrypted data without changes. This integrated architecture implements stronger encrypted file semantics and better coordination with other HDFS functions.

Architecture

Encryption Zones

An encryption zone is a directory in HDFS with all of its contents, that is, every file and subdirectory in it, encrypted. The files in this directory will be transparently encrypted upon write and transparently decrypted upon read. Each encryption zone is associated with a key which is specified when the zone is created. Each file within an encryption zone also has its own encryption/decryption key, called the Data Encryption Key (DEK). These DEKs are never stored persistently unless they are encrypted with the encryption zone's key. This encrypted DEK is known as the EDEK. The EDEK is then stored persistently as part of the file's metadata on the NameNode.

A key can have multiple key versions, where each key version has its own distinct key material (that is, the portion of the key used during encryption and decryption). Key rotation is achieved by modifying the encryption zone's key, that is, bumping up its version. Per-file key rotation is then achieved by re-encrypting the file's DEK with the new encryption zone key to create new EDEKs. An encryption key can be fetched either by its key name, returning the latest version of the key, or by a specific key version.

Key Management Server

A new service needs to be added to your cluster to store, manage, and access encryption keys, called the Hadoop Key Management Server (KMS). The KMS service is a proxy that interfaces with a backing key store on behalf of HDFS daemons and clients. Both the backing key store and the KMS implement the Hadoop KeyProvider client API.

Encryption and decryption of EDEKs happens entirely on the KMS. More importantly, the client requesting creation or decryption of an EDEK never handles the EDEK's encryption key (that is, the encryption zone key). When a new file is created in an encryption zone, the NameNode asks the KMS to generate a new EDEK encrypted with the encryption zone's key. When reading a file from an encryption zone, the NameNode provides the client with the file's EDEK and the encryption zone key version that was used to encrypt the EDEK. The client then asks the KMS to decrypt the EDEK, which involves checking that the client has permission to access the encryption zone key version. Assuming that is successful, the client uses the DEK to decrypt the file's contents. All the steps for read and write take place automatically through interactions between the DFSClient, the NameNode, and the KMS.

Access to encrypted file data and metadata is controlled by normal HDFS filesystem permissions. Typically, the backing key store is configured to only allow end-user access to the encryption zone keys used to encrypt DEKs. This means that EDEKs can be safely stored and handled by HDFS, since the hdfs user will not have access to EDEK encryption keys. This means that if HDFS is compromised (for example, by gaining unauthorized access to a superuser account), a malicious user only gains access to the ciphertext and EDEKs. This does not pose a security threat since access to encryption zone keys is controlled by a separate set of permissions on the KMS and key store.

For more details on configuring the KMS, see Configuring the Key Management Server (KMS).

Navigator Key Trustee

HDFS encryption can use a local Java KeyStore for key management. This is not sufficient for production environments where a more robust and secure key management solution is required. Cloudera Navigator Key Trustee Server is a key store for managing encryption keys and other secure deposits.

In order to leverage the manageable, highly-available key management capabilities of the Navigator Key Trustee Server, Cloudera provides a custom KMS service, the Key Trustee KMS.

For more information on integrating Navigator Key Trustee Server with HDFS encryption, see Integrating HDFS Encryption with Navigator Key Trustee Server.

crypto Command Line Interface

createZone

Use this command to create a new encryption zone.
-createZone -keyName <keyName> -path <path>
Where:
  • path: The path of the encryption zone to be created. It must be an empty directory.
  • keyName: Name of the key to use for the encryption zone.

listZones

List all encryption zones. This command requires superuser permissions.
-listZones

Enabling HDFS Encryption on a Cluster

Minimum Required Role: Full Administrator

Optimizing for HDFS Data at Rest Encryption

CDH implements the Advanced Encryption Standard New Instructions (AES-NI), which provide substantial performance improvements. To get these improvements, you need a recent version of libcrypto.so on HDFS and MapReduce client hosts -- that is, any host from which you originate HDFS or MapReduce requests. Many OS versions have an older version of the library that does not support AES-NI. The instructions that follow tell you what you need to do for each OS version that CDH supports.

(See Supported Operating Systems for the list of all supported OS).

RHEL/CentOS 6.5 or later

The installed version of libcrypto.so supports AES-NI, but you need to install the openssl-devel package on all clients:
$ sudo yum install openssl-devel

RHEL/CentOS 6.4 or earlier 6.x versions, or SLES 11

Download and extract a newer version of libcrypto.so from a CentOS 6.5 repository and install it on all clients in /var/lib/hadoop/extra/native/:
  1. Download the latest version of the openssl package. For example:
    $ wget http://mirror.centos.org/centos/6/os/x86_64/Packages/openssl-1.0.1e-30.el6.x86_64.rpm
    The libcrypto.so file in this package can be used on SLES 11 as well as RHEL/CentOS.
  2. Decompress the files in the package, but do not install it:
    $ rpm2cpio openssl-1.0.1e-30.el6.x86_64.rpm | cpio -idmv
  3. If you are using parcels, create the /var/lib/hadoop/extra/native/ directory:
    $ sudo mkdir -p /var/lib/hadoop/extra/native
  4. Copy the shared library into /var/lib/hadoop/extra/native/. Name the target file libcrypto.so, with no suffix at the end, exactly as in the command that follows.
    $ sudo cp ./usr/lib64/libcrypto.so.1.0.1e /var/lib/hadoop/extra/native/libcrypto.so

RHEL/CentOS 5

In this case, you need to build libcrypto.so and copy it to all clients:
  1. On one client, compile and install openssl from source:
    $ wget http://www.openssl.org/source/openssl-1.0.1j.tar.gz
    $ cd openssl-1.0.1j 
    $ ./config --shared --prefix=/opt/openssl-1.0.1j
    $ sudo make install
  2. If you are using parcels, create the /var/lib/hadoop/extra/native/ directory:
    $ sudo mkdir -p /var/lib/hadoop/extra/native
  3. Copy the files into /var/lib/hadoop/extra/native/:
    $ sudo cp /opt/openssl-1.0.1j/lib/libcrypto.so /var/lib/hadoop/extra/native
  4. Copy the files to the remaining clients using a utility such as rsync

Debian Wheezy

The installed version of libcrypto.so supports AES-NI, but you need to install the libssl-devel package on all clients:
$ sudo apt-get install libssl-dev

Ubuntu Precise and Ubuntu Trusty

Install the libssl-devel package on all clients:
$ sudo apt-get install libssl-dev

Testing if encryption optimization works

To verify that a client host is ready to make use of the AES-NI instruction set optimization for HDFS encryption at rest, use the following command:
hadoop checknative
You should see a response such as the following:
14/12/12 13:48:39 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2
library system-native14/12/12 13:48:39 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop:  true /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0
zlib:    true /lib64/libz.so.1
snappy:  true /usr/lib64/libsnappy.so.1
lz4:     true revision:99
bzip2:   true /lib64/libbz2.so.1
openssl: true /usr/lib64/libcrypto.so
If you see true in the openssl row, Hadoop has detected the right version of libcrypto.so and optimization will work. If you see false in this row, you do not have the right version.

Adding the KMS Service

  1. Make sure you have performed the steps described in Optimizing for HDFS Data at Rest Encryption, depending on the operating system you are using.
  2. On the Cloudera Manager Home page, click to the right of the cluster name and select Add a Service. A list of service types display. You can add one type of service at a time.
  3. Select the Java KeyStore KMS service and click Continue.
  4. Customize the assignment of role instances to hosts. You can click the View By Host button for an overview of the role assignment by hostname ranges.

    Click the field below the Key Management Server (KMS) role to display a dialog containing a list of hosts. Select the host for the new KMS role and click OK.

  5. Review and modify the JavaKeyStoreProvider Directory configuration setting if required and click Continue. The Java KeyStore KMS service is started.
  6. Click Continue, then click Finish. You are returned to the Home page.
  7. Verify the new Java KeyStore KMS service has started properly by checking its health status. If the Health Status is Good, then the service started properly.
  8. Follow the steps

Enabling Java KeyStore KMS for the HDFS Service

  1. Go to the HDFS service.
  2. Click the Configuration tab.
  3. Select Scope > HDFS (Service-Wide).
  4. Select Category > All.
  5. Locate the KMS Service property or search for it by typing its name in the Search box.
  6. Select the Java KeyStore KMS radio button for the KMS Service property.
  7. Click Save Changes.
  8. Restart your cluster.
    1. On the Home page, click to the right of the cluster name and select Restart.
    2. Click Restart that appears in the next screen to confirm. The Command Details window shows the progress of stopping services.

      When All services successfully started appears, the task is complete and you can close the Command Details window.

  9. Deploy client configuration.
    1. On the Home page, click to the right of the cluster name and select Deploy Client Configuration.
    2. Click Deploy Client Configuration.

Configuring Encryption Properties for the HDFS and NameNode

Configure the following properties to select the encryption algorithm and KeyProvider that will be used during encryption. If you do not modify these properties, the default values will use AES-CTR to encrypt your data.

Property Description
Selecting an Encryption Algorithm: Set the following properties in the core-site.xml safety valve and redeploy client configuration.
hadoop.security.crypto.codec.classes.EXAMPLECIPHERSUITE The prefix for a given crypto codec, contains a comma-separated list of implementation classes for a given crypto codec (for example, EXAMPLECIPHERSUITE). The first implementation will be used if available, others are fallbacks.

By default, the cipher suite used is AES/CTR/NoPadding and its default classes are org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec and org.apache.hadoop.crypto.JceAesCtrCryptoCodec as described in the following properties.

hadoop.security.crypto.cipher.suite

Cipher suite for crypto codec.

Default: AES/CTR/NoPadding
hadoop.security.crypto.codec.classes.aes.ctr.nopadding

Comma-separated list of crypto codec implementations for the default cipher suite: AES/CTR/NoPadding. The first implementation will be used if available, others are fallbacks.

Default: org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec, org.apache.hadoop.crypto.JceAesCtrCryptoCodec
hadoop.security.crypto.jce.provider

The JCE provider name used in CryptoCodec.

Default: None
hadoop.security.crypto.buffer.size

The buffer size used by CryptoInputStream and CryptoOutputStream.

Default: 8192
KeyProvider Configuration: Set this property in the hdfs-site.xml safety valve and restart the NameNode.
dfs.encryption.key.provider.uri

The KeyProvider to be used when interacting with encryption keys that are used to read and write to an encryption zone.

If you have a managed cluster, Cloudera Manager will point to the KMS server you have enabled above.

NameNode Configuration: Set this property in the hdfs-site.xml safety valve and restart the NameNode.
dfs.namenode.list.encryption.zones.num.responses

When listing encryption zones, the maximum number of zones that will be returned in a batch. Fetching the list incrementally in batches improves NameNode performance.

Default: 100

Creating Encryption Zones

Once a KMS has been set up and the NameNode and HDFS clients have been correctly configured, use the hadoop key and hdfs crypto command-line tools to create encryption keys and set up new encryption zones.

  • Create an encryption key for your zone as the application user that will be using the key. For example, if you are creating an encryption zone for HBase, create the key as the hbase user as follows:
    $ sudo -u hbase hadoop key create <key_name>
  • Create a new empty directory and make it an encryption zone using the key created above.
    $ hadoop fs -mkdir /zone
    $ hdfs crypto -createZone -keyName <key_name> -path /zone
    You can verify creation of the new encryption zone by running the -listZones command. You should see the encryption zone along with its key listed as follows:
    $ sudo -u hdfs hdfs crypto -listZones 
    /zone    <key_name>

For more information and recommendations on creating encryption zones for each CDH component, see Configuring CDH Services for HDFS Encryption.

Adding Files to an Encryption Zone

Existing data can be encrypted by coping it copied into the new encryption zones using tools like distcp. See the DistCp Considerations section below for information on using DistCp with encrypted data files.

You can add files to an encryption zone by copying them over to the encryption zone. For example:
sudo -u hdfs hadoop distcp /user/dir /user/enczone
Additional Information:

Backing Up Encryption Keys

It is critical that you regularly back up your encryption keys. Failure to do so can result in irretrievable loss of encrypted data.

If you are using the Java KeyStore KMS, make sure you regularly back up the Java KeyStore that stores the encryption keys. If you are using the Key Trustee KMS and Key Trustee Server, see Backing Up and Restoring Key Trustee Server for instructions on backing up Key Trustee Server and Key Trustee KMS.

DistCp Considerations

A common usecase for DistCp is to replicate data between clusters for backup and disaster recovery purposes. This is typically performed by the cluster administrator, who is an HDFS superuser. To retain this workflow when using HDFS encryption, a new virtual path prefix has been introduced, /.reserved/raw/, that gives superusers direct access to the underlying block data in the filesystem. This allows superusers to distcp data without requiring access to encryption keys, and avoids the overhead of decrypting and re-encrypting data. It also means the source and destination data will be byte-for-byte identical, which would not have been true if the data was being re-encrypted with a new EDEK.

Copying between encrypted and unencrypted locations

By default, distcp compares checksums provided by the filesystem to verify that data was successfully copied to the destination. When copying between an unencrypted and encrypted location, the filesystem checksums will not match since the underlying block data is different.

In this case, you can specify the -skipcrccheck and -update flags to avoid verifying checksums.

Attack Vectors

Type of Exploit Issue Mitigation
Hardware Access Exploit

These exploits assume the attacker has gained physical access to hard drives from cluster machines, that is, DataNodes and NameNodes.

Access to swap files of processes containing DEKs. This exploit does not expose cleartext, as it also requires access to encrypted block files. It can be mitigated by disabling swap, using encrypted swap, or using mlock to prevent keys from being swapped out.
Access to encrypted block files. This exploit does not expose cleartext, as it also requires access to the DEKs. It can only be mitigated by restricting physical access to the cluster machines.
Root Access Exploits

These exploits assume the attacker has gained root shell access to cluster machines running datanodes and namenodes. Many of these exploits cannot be addressed in HDFS, since a malicious root user has access to the in-memory state of processes holding encryption keys and cleartext. For these exploits, the only mitigation technique is carefully restricting and monitoring root shell access.

Access to encrypted block files.

By itself, this does not expose cleartext, as it also requires access to encryption keys.

No mitigation required.
Dump memory of client processes to obtain DEKs, delegation tokens, cleartext.

No mitigation.

Recording network traffic to sniff encryption keys and encrypted data in transit.

By itself, insufficient to read cleartext without the EDEK encryption key.

No mitigation required.
Dump memory of datanode process to obtain encrypted block data.

By itself, insufficient to read cleartext without the DEK.

No mitigation required.
Dump memory of namenode process to obtain encrypted data encryption keys.

By itself, insufficient to read cleartext without the EDEK's encryption key and encrypted block files.

No mitigation required.
HDFS Admin Exploits

These exploits assume that the attacker has compromised HDFS, but does not have root or hdfs user shell access.

Access to encrypted block files.

By itself, insufficient to read cleartext without the EDEK and EDEK encryption key.

No mitigation required.
Access to encryption zone and encrypted file metadata (including encrypted data encryption keys), using -fetchImage.

By itself, insufficient to read cleartext without EDEK encryption keys.

No mitigation required.
Root Access Exploits
  A rogue user can collect keys to which they have access, and use them later to decrypt encrypted data. This can be mitigated through periodic key rolling policies.