How to Enable S3 Cloud Storage

In CDH 5.9.0 (Hue 3.11), Hue adds support for Amazon S3 in its file browser, metastore, and editor interfaces. This page explains how to configure Hue with S3 and use it across the product.

Connect Hue to S3 Account

This section assumes that you have an Amazon S3 account. Let us connect to that account.

  1. If your S3 buckets use TLS and you are using custom truststores, see Connecting to Amazon S3 Using TLS for information about configuring Hue, Hive, and Impala to access S3 over TLS.
  2. Log on to Cloudera Manager and select Clusters > <your cluster name>.
  3. Select Configuration > Advanced Configuration Snippets.
  4. Filter by Scope > Hue.
  5. Set your S3 credentials in Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini:
    [aws]
    [[aws_accounts]]
    [[[default]]]
    access_key_id_script=</path/to/access_key_script>
    secret_access_key_script=</path/to/secret_key_script>
    #security_token=<your AWS security token>
    allow_environment_credentials=false
    region=<your region, such as us-east-1> 
    For a proof-of-concept installation, you can add the IDs directly.
    access_key_id=<your_access_key_id>
    secret_access_key=<your_secret_access_key>
  6. Clear the scope filters and input core-site.xml into the search box.
  7. To enable the S3 Browser, set your S3 credentials in Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml:
    <property>
    <name>fs.s3a.awsAccessKeyId</name>
    <value>AWS access key ID</value>
    </property>
    
    <property>
    <name>fs.s3a.awsSecretAccessKey</name>
    <value>AWS secret key</value>
    </property>
  8. To enable Hive with S3, set your S3 credentials in Hive Service Advanced Configuration Snippet (Safety Valve) for core-site.xml.
  9. Click Save Changes.
  10. Restart Hue: select Cluster > Hue and Actions > Restart.
  11. Restart Hive: select Cluster > Hive and Actions > Restart.

Populate S3 Bucket

In this section, we populate an S3 bucket with nested keys (bucket > directory > file) and add a CSV file of earthquake data from the USGS.

  1. Download 30 days of earthquake data (all_month.csv) from the USGS (~2 MB).
  2. In Cloudera Manager, click Hue > Web UI and log on to Hue.
  3. Select File Browser > S3 Browser.
  4. Click New > Bucket, name it "quakes_<any unique id>" and click Create.
  5. Navigate into the bucket by clicking the bucket name.
  6. Click New > Directory, name it "input" and click Create.
  7. Navigate into the directory by clicking the directory name.
  8. Click Upload and select, or drag, all_month.csv. The path is s3a://quakes/input/all_month.csv.

Create Table with S3 File

  1. Go to the Metastore Manager by clicking Data Browsers > Metastore Tables.
  2. Create a new table from a file by clicking the icon.
  3. Enter a Table Name such as "earthquakes".
  4. Browse for the Input Directory, s3a://quakes/input/, and click Select this folder.



  5. Select Create External Table from the Load Data menu and click Next.
  6. Delimit by Comma(,) and click Next.
  7. Click Create Table.
  8. Click Browse Data to automatically generate a SELECT query in the Hive editor:
    SELECT * FROM `default`.`earthquakes` LIMIT 10000;

Export Query Results to S3

  1. Run and Export Results in Hive
    1. Run the query by clicking the Execute Hue Run icon button.
    2. Click the Get Results Hue Download icon Blue button.
    3. Select Export to open the Save query result dialog.


  2. Save Results as Custom File
    1. Select In store (max 10000000 cells) and open the Path to CSV file dialog.
    2. Select S3A and navigate into the bucket, s3a://quakes.
    3. Create a folder named, "output." Click Create folder, enter name, click Create folder.
    4. Navigate into the output directory and click Select this folder.
    5. Append a file name to the path, such as quakes.cvs.
    6. Click Save. The results are saved as s3a://quakes/ouput/quakes.csv.

  3. Save Results as MapReduce files
    1. Select In store (large result) and open the Path to CSV file dialog.
    2. Select S3A and navigate into the bucket, s3a://quakes.
    3. If you have not done so, create a folder named, "output."
    4. Navigate into the output directory and click Select this folder.
    5. Click Save. A MapReduce job is run and results are stored in s3a://quakes/output/.

  4. Save Results as Table
    1. Run a query for "moment" earthquakes:
      SELECT time,
               latitude,
               longitude,
               mag
      FROM `default`.`earthquakes`
      WHERE magtype IN ('mw','mwb','mwc','mwr','mww');
    2. Select A new table and input <database>.<new table name>.
    3. Click Save.
    4. Click Browse Data to view the new table.

Troubleshoot Errors

This section addresses some error messages you may encounter when attempting to use Hue with S3.
  • Failed to access path
    Failed to access path: "s3a://quakes". Check that you have access to read this bucket and that the region is correct.
    Possible solution: Check your bucket region:
    1. Log on to your AWS account and navigate to the S3 service.
    2. Select your bucket, for example "quakes", and click Properties.
    3. Find your region. If it says US Standard, then region=us-east-1.
    4. Update your configuration in Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini.
    5. Save your changes and restart Hue.
  • The table could not be created

    The table could not be created. Error while compiling statement: FAILED: SemanticException com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain.
    Possible solution: Set your S3 credentials in Hive core-site.xml:
    1. In Cloudera Manager, go to Hive > Configuration.
    2. Filter by Category > Advanced.
    3. Set your credentials in Hive Service Advanced Configuration Snippet (Safety Valve) for core-site.xml.
      1. Click the button and input Name and Value for fs.s3a.awsAccessKeyId.
      2. Click the button and input Name and Value for fs.s3a.awsSecretAccessKey.
    4. Save your changes and restart Hive.
  • The target path is a directory

    Possible solution: Remove any directories or files that may have been added to s3a://quakes/input/ (so that all_month.csv is alone).

  • Bad status for request TFetchResultsReq … Not a file

    Bad status for request TFetchResultsReq(...): TFetchResultsResp(status=TStatus(errorCode=0, errorMessage='java.io.IOException: java.io.IOException: Not a file: s3a://Not a file: s3a://quakes/input/output' ...

    Possible solution: Remove any directories or files that may have been added to s3a://quakes/input/ (so that all_month.csv is alone). Here, Hive cannot successfully query the earthquakes table (based on all_month.csv) due to the directory, s3a://quakes/input/output.