Configuring CDH Services for HDFS Encryption

The following topics contain recommendations for setting up HDFS Transparent Encryption with various CDH services.

Hive

HDFS encryption has been designed so that files cannot be moved from one encryption zone to another or from encryption zones to unencrypted directories. Therefore, the landing zone for data when using the LOAD DATA INPATH command must always be inside the destination encryption zone.

To use HDFS encryption with Hive, ensure you are using one of the following configurations:

Single Encryption Zone

With this configuration, you can use HDFS encryption by having all Hive data inside the same encryption zone. In Cloudera Manager, configure the Hive Scratch Directory (hive.exec.scratchdir) to be inside the encryption zone.

Recommended HDFS Path: /user/hive

For example, to configure a single encryption zone for the entire Hive warehouse, you can rename /user/hive to /user/hive-old, create an encryption zone at /user/hive, and then distcp all the data from /user/hive-old to /user/hive.

In Cloudera Manager, configure the Hive Scratch Directory (hive.exec.scratchdir) to be inside the encryption zone by setting it to /user/hive/tmp, ensuring that permissions are 1777 on /user/hive/tmp.

Multiple Encryption Zones

With this configuration, you can use encrypted databases or tables with different encryption keys. To read data from read-only encrypted tables, users must have access to a temporary directory that is encrypted at least as strongly as the table.

For example:

  1. Configure two encrypted tables, ezTbl1 and ezTbl2.
  2. Create two new encryption zones, /data/ezTbl1 and /data/ezTbl2.
  3. Load data to the tables in Hive using LOAD statements.

For more information, see Changed Behavior after HDFS Encryption is Enabled.

Other Encrypted Directories

  • LOCALSCRATCHDIR: The MapJoin optimization in Hive writes HDFS tables to a local directory and then uploads them to the distributed cache. To ensure these files are encrypted, either disable MapJoin by setting hive.auto.convert.join to false, or encrypt the local Hive Scratch directory (hive.exec.local.scratchdir) using Cloudera Navigator Encrypt.
  • DOWNLOADED_RESOURCES_DIR: JARs that are added to a user session and stored in HDFS are downloaded to hive.downloaded.resources.dir on the HiveServer2 local filesystem. To encrypt these JAR files, configure Cloudera Navigator Encrypt to encrypt the directory specified by hive.downloaded.resources.dir.
  • NodeManager Local Directory List: Hive stores JARs and MapJoin files in the distributed cache. To use MapJoin or encrypt JARs and other resource files, the yarn.nodemanager.local-dirs YARN configuration property must be configured to a set of encrypted local directories on all nodes.

Changed Behavior after HDFS Encryption is Enabled

  • Loading data from one encryption zone to another results in a copy of the data. Distcp is used to speed up the process if the size of the files being copied is higher than the value specified by HIVE_EXEC_COPYFILE_MAXSIZE. The minimum size limit for HIVE_EXEC_COPYFILE_MAXSIZE is 32 MB, which you can modify by changing the value for the hive.exec.copyfile.maxsize configuration property.
  • When loading data to encrypted tables, Cloudera strongly recommends using a landing zone inside the same encryption zone as the table.
    • Example 1: Loading unencrypted data to an encrypted table - Use one of the following methods:
      • If you are loading new unencrypted data to an encrypted table, use the LOAD DATA ... statement. Because the source data is not inside the encryption zone, the LOAD statement results in a copy. For this reason, Cloudera recommends landing data that you need to encrypt inside the destination encryption zone. You can use distcp to speed up the copying process if your data is inside HDFS.
      • If the data to be loaded is already inside a Hive table, you can create a new table with a LOCATION inside an encryption zone as follows:
        CREATE TABLE encrypted_table [STORED AS] LOCATION ... AS SELECT * FROM <unencrypted_table>
        The location specified in the CREATE TABLE statement must be inside an encryption zone. Creating a table pointing LOCATION to an unencrypted directory does not encrypt your source data. You must copy your data to an encryption zone, and then point LOCATION to that zone.
    • Example 2: Loading encrypted data to an encrypted table - If the data is already encrypted, use the CREATE TABLE statement pointing LOCATION to the encrypted source directory containing the data. This is the fastest way to create encrypted tables.
      CREATE TABLE encrypted_table [STORED AS] LOCATION ... AS SELECT * FROM <encrypted_source_directory>
  • Users reading data from encrypted tables that are read-only must have access to a temporary directory which is encrypted with at least as strong encryption as the table.
  • Temporary data is now written to a directory named .hive-staging in each table or partition
  • Previously, an INSERT OVERWRITE on a partitioned table inherited permissions for new data from the existing partition directory. With encryption enabled, permissions are inherited from the table.

Impala

Recommendations

  • If HDFS encryption is enabled, configure Impala to encrypt data spilled to local disk.

  • In releases lower than Impala 2.2.0 / CDH 5.4.0, Impala does not support the LOAD DATA statement when the source and destination are in different encryption zones. If you are running an affected release and need to use LOAD DATA with HDFS encryption enabled, copy the data to the table's encryption zone prior to running the statement.

  • Use Cloudera Navigator to lock down the local directory where Impala UDFs are copied during execution. By default, Impala copies UDFs into /tmp, and you can configure this location through the --local_library_dir startup flag for the impalad daemon.

  • Limit the rename operations for internal tables once encryption zones are set up. Impala cannot do an ALTER TABLE RENAME operation to move an internal table from one database to another, if the root directories for those databases are in different encryption zones. If the encryption zone covers a table directory but not the parent directory associated with the database, Impala cannot do an ALTER TABLE RENAME operation to rename an internal table, even within the same database.

  • Avoid structuring partitioned tables where different partitions reside in different encryption zones, or where any partitions reside in an encryption zone that is different from the root directory for the table. Impala cannot do an INSERT operation into any partition that is not in the same encryption zone as the root directory of the overall table.

  • If the data files for a table or partition are in a different encryption zone than the HDFS trashcan, use the PURGE keyword at the end of the DROP TABLE or ALTER TABLE DROP PARTITION statement to delete the HDFS data files immediately. Otherwise, the data files are left behind if they cannot be moved to the trashcan because of differing encryption zones. This syntax is available in Impala 2.3 / CDH 5.5 and higher.

Steps

Start every impalad process with the --disk_spill_encryption=true flag set. This encrypts all spilled data using AES-256-CFB. Set this flag using the Impala service configuration property Impala Daemon Command Line Argument Advanced Configuration Snippet (Safety Valve), under Impala Daemon Default Group > Advanced.

HBase

Recommendations

Make /hbase an encryption zone. Do not create encryption zones as subdirectories under /hbase, because HBase may need to rename files across those subdirectories.

Steps

On a cluster without HBase currently installed, create the /hbase directory and make that an encryption zone.

On a cluster with HBase already installed, perform the following steps:
  1. Stop the HBase service.
  2. Move data from the /hbase directory to /hbase-tmp.
  3. Create an empty /hbase directory and make it an encryption zone.
  4. Distcp all data from /hbase-tmp to /hbase, preserving user-group permissions and extended attributes.
  5. Start the HBase service and verify that it is working as expected.
  6. Remove the /hbase-tmp directory.

Search

Recommendations

Make /solr an encryption zone.

Steps

On a cluster without Solr currently installed, create the /solr directory and make that an encryption zone.

On a cluster with Solr already installed:

  1. Create an empty /solr-tmp directory.
  2. Make /solr-tmp an encryption zone.
  3. DistCp all data from /solr into /solr-tmp.
  4. Remove /solr, and rename /solr-tmp to /solr.

Sqoop

Recommendations

  • For Hive support: Ensure that you are using Sqoop with the --target-dir parameter set to a directory that is inside the Hive encryption zone. For more details, see Hive.
  • For append/incremental support: Make sure that the sqoop.test.import.rootDir property points to the same encryption zone as the --target-dir argument.
  • For HCatalog support: No special configuration is required.

Hue

Recommendations

Make /user/hue an encryption zone because Oozie workflows and other Hue-specific data are stored there by default.

Steps

On a cluster without Hue currently installed, create the /user/hue directory and make it an encryption zone.

On a cluster with Hue already installed:

  1. Create an empty /user/hue-tmp directory.
  2. Make /user/hue-tmp an encryption zone.
  3. DistCp all data from /user/hue into /user/hue-tmp.
  4. Remove /user/hue and rename /user/hue-tmp to /user/hue.

Spark

Recommendations

  • By default, application event logs are stored at /user/spark/applicationHistory, which can be made into an encryption zone.
  • Spark also optionally caches its JAR file at /user/spark/share/lib (by default), but encrypting this directory is not required.
  • Spark does not encrypt shuffle data. To do so, configure the Spark local directory, spark.local.dir (in Standalone mode), to reside on an encrypted disk. For YARN mode, make the corresponding YARN configuration changes.

MapReduce and YARN

MapReduce v1

Recommendations

MRv1 stores both history and logs on local disks by default. Even if you do configure history to be stored on HDFS, the files are not renamed. Hence, no special configuration is required.

MapReduce v2 (YARN)

Recommendations

Make /user/history a single encryption zone, because history files are moved between the intermediate and done directories, and HDFS encryption does not allow moving encrypted files across encryption zones.

Steps

On a cluster with MRv2 (YARN) installed, create the /user/history directory and make that an encryption zone.

If /user/history already exists and is not empty:

  1. Create an empty /user/history-tmp directory.
  2. Make /user/history-tmp an encryption zone.
  3. DistCp all data from /user/history into /user/history-tmp.
  4. Remove /user/history and rename /user/history-tmp to /user/history.