Configuring CDH Services for HDFS Encryption

The following topics contain recommendations for setting up HDFS encryption with various CDH services.

Hive

HDFS encryption has been designed such that files cannot be moved from one encryption zone to another encryption zone or from encryption zones to unencrypted directories. Hence, the landing zone for data when using the LOAD DATA INPATH command should always be inside the destination encryption zone.

If you want to use HDFS encryption with Hive in CDH 5.2, ensure you are using one of the following configurations:

Single Encryption Zone

With this configuration, you can use HDFS encryption by having all Hive data inside the same encryption zone. Additionally, in Cloudera Manager, configure the Hive Scratch Directory (hive.exec.scratchdir) to be inside the encryption zone.

Recommended HDFS Path: /user/hive

For example, to configure a single encryption zone for the entire Hive warehouse, you can rename /user/hive to /user/hive-old, create an encryption zone at /user/hive, and then distcp all the data from /user/hive-old to /user/hive.

Additionally, in Cloudera Manager, configure the Hive Scratch Directory (hive.exec.scratchdir) to be inside the encryption zone by setting it to /user/hive/tmp, ensuring the permissions are 1777 on /user/hive/tmp.

Separate HiveServer2 For Each Encryption Zone

With this configuration, you can use multiple encryption zones and configure a separate HiveServer2 to serve data for each encryption zone. The limitation with this configuration is that data cannot be used across encryption zones. Ensure each HiveServer2's Scratch Directory (hive.exec.scratchdir) is inside its assigned encryption zone.

Recommended HDFS Path: /data/{eza,ezb} for the hs2 per EZ use case

For example, you can configure two HiveServer2 instances, HS2A and HS2B to point encryption zones ezA and ezB as follows. Create two new encryption zones, /data/ezA and /data/ezB. Then create two separate HiveServer2 instances HS2A and HS2B, configuring HS2A's Scratch Directory to point at /data/ezA/tmp and HS2B's Scratch Directory to point at /data/ezB/tmp ensuring the permissions are 1777 on both tmp directories.

Other Encrypted Directories

  • LOCALSCRATCHDIR: The MapJoin optimization in Hive writes HDFS tables out to a local directory and then uploads them to the distributed cache. If you want to enable encryption, you will either need to disable MapJoin or encrypt the local Hive Scratch directory (hive.exec.local.scratchdir).
  • DOWNLOADED_RESOURCES_DIR: Jars which are added to a user session and stored in HDFS are downloaded to hive.downloaded.resources.dir. If you want these Jar files to be encrypted, configure hive.downloaded.resources.dir to be part of an encryption zone. This directory is local to the HiveServer2.
  • NodeManager Local Directory List: Since Hive stores Jars and MapJoin files in the distributed cache, if you'd like to use MapJoin or encrypt Jars and other resource files, the YARN configuration property, NodeManager Local Directory List (yarn.nodemanager.local-dirs), must be configured to a set of encrypted local directories on all nodes.

    Alternatively, you can disable MapJoin by setting hive.auto.convert.join to false.

Impala

Recommendations

  • If HDFS encryption is enabled, configure Impala to encrypt data spilled to local disk.

  • Impala does not support the LOAD DATA statement when the source and destination are in different encryption zones. If you need to use LOAD DATA, copy the data to the table's encryption zone prior to running the statement.

  • Use Cloudera Navigator to lock down the local directory where Impala UDFs are copied during execution. By default, Impala copies UDFs into /tmp, and you can configure this location through the --local_library_dir startup flag for the impalad daemon.

  • Limit the rename operations for internal tables once encryption zones are set up. Impala cannot do an ALTER TABLE RENAME operation to move an internal table from one database to another, if the root directories for those databases are in different encryption zones. If the encryption zone covers a table directory but not the parent directory associated with the database, Impala cannot do an ALTER TABLE RENAME operation to rename an internal table even within the same database.

  • Avoid structuring partitioned tables where different partitions reside in different encryption zones, or where any partitions reside in an encryption zone that is different from the root directory for the table. Impala cannot do an INSERT operation into any partition that is not in the same encryption zone as the root directory of the overall table.

Steps

Start every impalad process with the --disk_spill_encryption=true flag set. This encrypts all spilled data using AES-256-CFB. Set this flag using the Impala service configuration property, Impala Daemon Command Line Argument Advanced Configuration Snippet (Safety Valve), found under Impala Daemon Default Group > Advanced.

HBase

Recommendations

Make /hbase an encryption zone. Do not create encryption zones as subdirectories under /hbase, as HBase may need to rename files across those subdirectories.

Steps

On a cluster without HBase currently installed, just create /hbase directory and make that an encryption zone. On a cluster with HBase already installed, create an empty /hbase-tmp directory, make /hbase-tmp an encryption zone, distcp all data from /hbase into /hbase-tmp, remove /hbase and rename /hbase-tmp to /hbase.

Search

Recommendations

Make /solr an encryption zone.

Steps

On a cluster without Solr currently installed, create the /solr directory and make that an encryption zone. On a cluster with Solr already installed, create an empty /solr-tmp directory, make /solr-tmp an encryption zone, distcp all data from /solr into /solr-tmp, remove /solr and rename /solr-tmp to /solr.

Sqoop

Recommendations

  • For Hive support: Ensure that you are using Sqoop with the --target-dir parameter set to a directory that is inside the Hive encryption zone. For more details, see Hive
  • For append/incremental support: Make sure that the sqoop.test.import.rootDir property points to the same encryption zone as the above --target-dir argument.
  • For HCatalog support: No special configuration should be required

Hue

Recommendations

Make /user/hue an encryption zone since that's where Oozie workflows and other Hue specific data are stored by default.

Steps

On a cluster without Hue currently installed, create the /user/hue directory and make that an encryption zone. On a cluster with Hue already installed, create an empty /user/hue-tmp directory, make /user/hue-tmp an encryption zone, distcp all data from /user/hue into /user/hue-tmp, remove /user/hue and rename /user/hue-tmp to /user/hue.

Spark

Recommendations

  • By default, application event logs are stored at /user/spark/applicationHistory which can be made into an encryption zone.
  • Spark also optionally caches its jar file at /user/spark/share/lib (by default), but encrypting this directory is not necessary.
  • Spark does not encrypt shuffle data. However, if that is desired, you should configure Spark's local directory, spark.local.dir (in Standalone mode), to reside on an encrypted disk. For YARN mode, make the corresponding YARN configuration changes.

MapReduce and YARN

MapReduce v1

Recommendations

MRv1 stores both history and logs on local disks by default. Even if you do configure history to be stored on HDFS, the files are not renamed. Hence, no special configuration is required.

MapReduce v2 (YARN)

Recommendations

Make /user/history a single encryption zone, since history files are moved between the intermediate and done directories, and HDFS encryption does not allow moving encrypted files across encryption zones.

Steps

On a cluster with MRv2 (YARN) installed, create the /user/history directory and make that an encryption zone. If /user/history already exists and is not empty, create an empty /user/history-tmp directory, make /user/history-tmp an encryption zone, distcp all data from /user/history into /user/history-tmp, remove /user/history and rename /user/history-tmp to /user/history.