Writing Encrypted Data to Secure S3 Buckets From Altus Jobs

You can write data to a secure Amazon S3 buckets from Altus jobs. Altus supports writing data to S3 buckets secured with server-side encryption (SSE). SSE uses symmetric AES256 encryption. You can encrypt your data with an Amazon S3-managed encryption key (SSE-S3) or a customer master key generated with the Amazon KMS service (SSE-KMS). If you use SSE-KMS encryption, you must specify the customer master key to use to encrypt data.

To encrypt data that you write to the S3 bucket, you must set the following HDFS configuration properties:
Property Description
fs.s3a.server-side-encryption-algorithm Encryption mechanism to use for encrypting the data to be written to S3. You can set this property to one of the following mechanisms:
  • AES256. To encrypt data using SSE-S3 encryption keys, set the encryption algorithm to AES256. SSE-S3 supports only the AES256 encryption algorithm.
  • SSE-KMS. To encrypt data using a customer master key, set the encryption algorithm to SSE-KMS. You must specify the customer master key in the server-side-encryption-key property.
fs.s3a.server-side-encryption-key The customer master key (CMK) to use to encrypt the data to be written to S3. Specify the Amazon resource name (ARN) for the customer master key.

Required if the server-side encryption mechanism is set to SSE-KMS.

For more information about using SSE-KMS, see How Amazon Simple Storage Service (Amazon S3) Uses AWS KMS.

For more information about encrypting data for S3, see Working with Encrypted S3 Data.

The procedure to set the encryption configuration properties varies based on the job type.

To configure Altus jobs to write encrypted data to a secure S3 bucket, complete the steps for the type of job you want to run:
Spark or PySpark job
Include the encryption algorithm as Spark arguments.
MapReduce2 job
Create a job XML configuration file to specify the encryption algorithm. Then pass the configuration file to the MapReduce2 job through the main class of the application.

Configuring a Spark or PySpark Job to Write to a Secure S3 Bucket

When you run a Spark or PySpark job in Altus, you can write data to an Amazon S3 bucket that you have configured with server-side encryption.

When you submit a Spark or PySpark job in Altus, include the encryption properties as Spark arguments.

The following example shows how to set the encryption properties in the Spark arguments parameter of a PySpark job to write encrypted data to a secure S3:

altus dataeng submit-jobs \
    --cluster-name=ClusterName \   
    --jobs '{
        "name": "Encrypted Word Count",
        "pySparkJob": {
             "mainPy": "s3a://cloudera-altus-data-engineering-samples/pyspark/wordcount/wordcount2.py",
             "sparkArguments" : "--executor-memory 1G --num-executors 2
                                 --conf spark.hadoop.fs.s3a.server-side-encryption-algorithm=SSE-KMS 
                                 --conf spark.hadoop.fs.s3a.server-side-encryption-key=CustomerMasterKeyARN",
             "applicationArguments": [
                   "s3a://cloudera-altus-data-engineering-samples/pyspark/wordcount/input/HadoopPoem0.txt",
                   "s3a://NameOfSecureS3Bucket/PathTo/OutputFile"
             ]
        }}’

Configuring a MapReduce2 Job to Write to a Secure S3 Bucket

When you run a MapReduce2 job in Altus, you can write data to an Amazon S3 bucket that you have configured with server-side encryption.

To configure the MapReduce2 job to write encrypted data to a secure S3, include the data encryption properties in an job configuration file. Load the configuration file into the Java main class of the MapReduce2 application.

For more information about running a MapReduce2 job in Altus and using the job configuration file, see Submitting a MapReduce Job.

The following example configuration file shows the properties that can be used to set the encryption mechanism and KMS key for a MapReduce2 job :
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
  <property>
    <name>fs.s3a.server-side-encryption-algorithm</name>
    <value>SSE-KMS</value>
  </property>
  <property>
    <name>fs.s3a.server-side-encryption-key</name>
    <value>CustomerMasterKeyARN</value>
  </property>
 </configuration>

Example: MapReduce2 Job Writing to a Secure S3 Bucket

The following example of a MapReduce2 job is available in the Cloudera Altus S3 bucket of job examples. The job reads a file from an S3 bucket, counts the major words in the text, and writes the result to an encrypted S3 bucket.

To use this example, you must set up an S3 bucket with server-side SSE-S3 encryption to which the MapReduce2 job can write encrypted data. The submit-jobs command includes the jobXml parameter that passes the encryption properties to the MapReduce2 job.

Use the following command to submit the example job in Altus:

altus dataeng submit-jobs \
--cluster-name ClusterName \
--jobs '{ "mr2Job": {
              "mainClass": "com.cloudera.altus.sample.mr2.wordcount.WordCount",
              "jars": ["s3a://cloudera-altus-data-engineering-samples/mr2/wordcount/program/altus-sample-mr2.jar"],
              "arguments": [
                  "s3a://cloudera-altus-data-engineering-samples/mr2/wordcount/input/poetry/",
                  "s3a://NameOfSecureS3Bucket/OutputPath/"],
              "jobXml": 
               "<?xml version=\"1.0\" encoding=\"UTF-8\"?><configuration><property><name>fs.s3a.server-side-encryption-algorithm</name><value>AES256</value></property></configuration>"
         }}'