Securing Connections to Amazon S3

Hadoop's hadoop-aws connector allows certain CDH components such as Impala and Spark to integrate with Amazon Web Services (AWS). CDH in particular, supports only AWS S3's third generation s3a filesystem scheme, which was introduced as an improvement over the native S3 filesystem with its support for larger files and better performance.

You can configure your AWS credentials using the following properties to give CDH services access to S3.

<property>
  <name>fs.s3a.access.key</name>
  <value>your_access_key</value>
</property>

<property>
  <name>fs.s3a.secret.key</name>
  <value>your_secret_key</value>
</property>

You should note that the AWS credentials listed above are used to determine who has read/write access to AWS data. Therefore, it is imperative that you do not share these credentials with other cluster users or services. You should also make sure that these credentials do not appear in cleartext in any configuration files, log files, or UIs. One way to protect your AWS credentials is to run Hadoop jobs using a separate set of temporary credentials which expire after a configurable period of time. Thus, you do not need to worry about your access key and secret key persisting in the Hadoop configuration and log files months after you have accessed S3, because those credentials will have expired.

Using Temporary Credentials to Connect to Amazon S3

AWS provides a Security Token Service (STS) that issues temporary credentials to access AWS services such as S3. The temporary credentials consists of an access key, a secret key (like regular AWS credentials), and a session token.

To connect to S3 using temporary credentials, provide the credential provider and session token along with the temporary AWS credentials as command-line arguments when you submit a job.
-Dfs.s3a.access.key=your_temp_access_key
-Dfs.s3a.secret.key=your_temp_secret_key
-Dfs.s3a.session.token=your_session_token_from_AmazonSTSyour_session_token_from_AmazonSTS
-Dfs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider

Connecting to Amazon S3 Using TLS

The boolean parameter fs.s3a.connection.ssl.enabled in core-site.xml controls whether the hadoop-aws connector uses TLS when communicating with Amazon S3. Because this parameter is set to true by default, you do not need to configure anything to enable TLS. If you are not using TLS on Amazon S3, the connector will automatically fall back to a plaintext connection.

The root Certificate Authority (CA) certificate that signed the Amazon S3 certificate is trusted by default. If you are using custom truststores, make sure that the configured truststore for each service trusts the root CA certificate.

To import the root CA certificate into your custom truststore, run the following command:

$ $JAVA_HOME/bin/keytool -importkeystore -srckeystore $JAVA_HOME/jre/lib/security/cacerts -destkeystore /path/to/custom/truststore -srcalias baltimorecybertrustca

If you do not have the $JAVA_HOME variable set, replace it with the path to the Oracle JDK (for example, /usr/java/jdk1.7.0_67-cloudera/). When prompted, enter the password for the destination and source truststores. The default password for the Oracle JDK cacerts truststore is changeit.

The truststore configurations for each service that accesses S3 are as follows:

hadoop-aws Connector

All components that can use Amazon S3 storage rely on the hadoop-aws connector, which uses the built-in Java truststore ($JAVA_HOME/jre/lib/security/cacerts). To override this truststore, create a truststore named jssecacerts in the same directory ($JAVA_HOME/jre/lib/security/jssecacerts) on all cluster nodes. If you are using the jssecacerts truststore, make sure that it includes the root CA certificate that signed the Amazon S3 certificate.

Hive/Beeline CLI

The Hive and Beeline command line interfaces (CLI) rely on the HiveServer2 truststore. To view or modify the truststore configuration:

  1. Go to the Hive service in the Cloudera Manager Admin Interface.
  2. Select the Configuration tab.
  3. Select Scope > HIVE-1 (Service-Wide).
  4. Select Category > Security.
  5. Locate the HiveServer2 TLS/SSL Certificate Trust Store File and HiveServer2 TLS/SSL Certificate Trust Store Password properties or search for them by typing Trust in the Search box.

Impala Shell

The Impala shell uses the hadoop-aws connector truststore. To override it, create the $JAVA_HOME/jre/lib/security/jssecacerts file, as described in hadoop-aws Connector.

Hue S3 File Browser

For instructions on enabling the S3 file browser in Hue, see How to Enable S3 Cloud Storage. The S3 file browser uses TLS if it is enabled, and the S3 File Browser trusts the S3 certificate by default. No additional configuration is necessary.

Impala Query Editor (Hue)

The Impala query editor in Hue uses the hadoop-aws connector truststore. To override it, create the $JAVA_HOME/jre/lib/security/jssecacerts file, as described in hadoop-aws Connector.

Hive Query Editor (Hue)

The Hive query editor in Hue uses the HiveServer2 truststore. For instructions on viewing and modifying the HiveServer2 truststore, see Hive/Beeline CLI.

Enabling Server-Side Encryption for Data At-Rest on Amazon S3

Server-side encryption for Amazon S3 (s3a filesystem) protects data at-rest. To enable server-side encryption:
  1. Go to the Cloudera Manager Admin Console and navigate to the HDFS service.
  2. Click the Configuration tab.
  3. Select Scope > HDFS (Service-Wide).
  4. Select Category > Advanced.
  5. Locate the Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml property and add the following property:
    <property>
      <name>fs.s3a.server-side-encryption-algorithm</name>
      <value>AES256</value>
      <description>Specify a server-side encryption algorithm for S3A.
      The default is NULL, and the only other currently allowable value is AES256.
      </description>
    </property>
  6. Click Save Changes to commit the changes.
  7. Restart the HDFS service.