Configuring Encryption for Data Spills

Some CDH services can encrypt data that lives temporarily on the local filesystem outside HDFS. This usually includes data that may spill to disk when operations are too memory intensive and the service exceeds its allotted memory limit on a host. You can enable on-disk spill encryption for the following services.

MapReduce v2 (YARN)

MapReduce v2 allows you to encrypt intermediate files generated during encrypted shuffle and in case of data spills during the map and reduce stages. Enable this by setting the following properties in mapred-site.xml.
mapreduce.job.encrypted-intermediate-data Enable or disable encryption for intermediate MapReduce spills.

Default: false

mapreduce.job.encrypted-intermediate-data-key-size-bits The key length used to encrypt data spilled to disk.

Default: 128

mapreduce.job.encrypted-intermediate-data.buffer.kb The buffer size in Kb for the stream written to disk after encryption.

Default: 128

HBase

HBase does not write data outside HDFS, and does not require spill encryption.

Impala

Impala allows certain memory-intensive operations to be able to write temporary data to disk in case these operations approach their memory limit on a host. For details, read SQL Operations that Spill to Disk. To enable disk spill encryption in Impala:

  1. Go to the Cloudera Manager Admin Console.
  2. Click the Configuration tab.
  3. Select Scope > Impala Daemon.
  4. Select Category > Security.
  5. Check the checkbox for the Disk Spill Encryption property.
  6. Click Save Changes to commit the changes.

Hive

Hive jobs occasionally write data temporarily to local directories. If you enable HDFS encryption, you must ensure that the following intermediate local directories are also protected:

  • LOCALSCRATCHDIR: The MapJoin optimization in Hive writes HDFS tables to a local directory and then uploads them to the distributed cache. To ensure these files are encrypted, either disable MapJoin by setting hive.auto.convert.join to false, or encrypt the local Hive Scratch directory (hive.exec.local.scratchdir) using Cloudera Navigator Encrypt.
  • DOWNLOADED_RESOURCES_DIR: JARs that are added to a user session and stored in HDFS are downloaded to hive.downloaded.resources.dir on the HiveServer2 local filesystem. To encrypt these JAR files, configure Cloudera Navigator Encrypt to encrypt the directory specified by hive.downloaded.resources.dir.
  • NodeManager Local Directory List: Hive stores JARs and MapJoin files in the distributed cache. To use MapJoin or encrypt JARs and other resource files, the yarn.nodemanager.local-dirs YARN configuration property must be configured to a set of encrypted local directories on all nodes.

For more information on Hive behavior with HDFS encryption enabled, see Using HDFS Encryption with Hive.

Flume

Flume supports on-disk encryption for log files written by the Flume file channels. See Configuring Encrypted On-disk File Channels for Flume.