Using S3 Credentials with YARN, MapReduce, or Spark

This topic describes how to access data stored in S3 for applications that use YARN, MapReduce, or Spark.

You can also copy data using the Hadoop distcp command. See Using DistCp with Amazon S3.

Continue reading:

Referencing Credentials for Clients Using the Amazon S3 Service
Referencing Amazon S3 in URIs

Referencing Credentials for Clients Using the Amazon S3 Service

If you have selected IAM authentication, no additional steps are needed. If you are not using IAM authentication, use one of the following three options to provide Amazon S3 credentials to clients:

Programmatic

Specify the credentials in the configuration for the job. This option is most useful for Spark jobs.

Make a modified copy of the configuration files

Make a copy of the configuration files and add the S3 credentials:

For YARN and MapReduce jobs, copy the contents of the /etc/hadoop/conf directory to a local directory under the home directory of the host where you will submit the job. For Spark jobs, copy /etc/spark/conf to a local directory under the home directory of the host where you will submit the job.
Set the permissions for the configuration files appropriately for your environment and ensure that unauthorized users cannot access sensitive configurations in these files.

Add the following to the core-site.xml file within the <configuration> element:

<property>
    <name>fs.s3a.access.key</name>
    <value>Amazon S3 Access Key</value>
</property>

<property>
    <name>fs.s3a.secret.key</name>
    <value>Amazon S3 Secret Key</value>
</property>

Reference these versions of the configuration files when submitting jobs by running the following command:
- YARN or MapReduce:
```
export HADOOP_CONF_DIR=path to local configuration directory
```
- Spark:
```
export SPARK_CONF_DIR=path to local configuration directory
```

Reference the managed configuration files and add AWS credentials

This option allows you to continue to use the configuration files managed by Cloudera Manager. If you deploy new configuration files, the new values are included by reference in your copy of the configuration files while also maintaining a version of the configuration that contains the Amazon S3 credentials:

Create a local directory under your home directory.
Copy the configuration files from /etc/hadoop/conf to the new directory.
Set the permissions for the configuration files appropriately for your environment.
Edit each configuration file:
1. Remove all elements within the <configuration> element.
2. Add an XML <include> element within the <configuration> element to reference the configuration files managed by Cloudera Manager. For example:
```
<include xmlns="http://www.w3.org/2001/XInclude"
         href="/etc/hadoop/conf/hdfs-site.xml">
   <fallback />
</include>
```

Add the following to the core-site.xml file within the <configuration> element:

<property>
    <name>fs.s3a.access.key</name>
    <value>Amazon S3 Access Key</value>
</property>

<property>
    <name>fs.s3a.secret.key</name>
    <value>Amazon S3 Secret Key</value>
</property>

Reference these versions of the configuration files when submitting jobs by running the following command:
- YARN or MapReduce:
```
export HADOOP_CONF_DIR=path to local configuration directory
```
- Spark:
```
export SPARK_CONF_DIR=path to local configuration directory
```

Example core-site.xml file:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <include xmlns="http://www.w3.org/2001/XInclude"
    href="/etc/hadoop/conf/core-site.xml">
    <fallback />
  </include>

  <property>
    <name>fs.s3a.access.key</name>
    <value>Amazon S3 Access Key</value>
  </property>

  <property>
    <name>fs.s3a.secret.key</name>
    <value>Amazon S3 Secret Key</value>
  </property>
</configuration>

Referencing Amazon S3 in URIs

By default, files are still placed on the local HDFS and not on S3 if the protocol is not specified in the URI. When you have added the Amazon S3 service, use one of the following options to construct the URIs to reference when submitting jobs:

Amazon S3:
```
s3a://bucket_name/path
```
HDFS:
```
hdfs://path
```
or
```
/path
```

For more information about using Impala, Hive, and Spark on S3, see:

Configuring the Amazon S3 Connector

Using Fast Upload with Amazon S3