Cloudera Navigator and S3

As cloud storage becomes increasingly prevalent, many clusters use Amazon Simple Storage Service (S3) for long-running, persistent storage.

Starting with Cloudera Navigator release 2.9.0, you can use Navigator to view technical metadata, assign business metadata, and view lineage for S3 objects on your cluster. This topic provides an overview of S3 metadata entities in Navigator, describes how to configure Navigator to use S3 data, and describes unique aspects and limitations of working with S3 data in Navigator.

Continue reading:

S3 Metadata Entities in Navigator
Enabling Cloudera Navigator Access to Amazon S3
Eventual Consistency
S3 Event Notification
Setting API Limits
Limitations of Navigator for S3

S3 Metadata Entities in Navigator

Amazon S3 has a flat structure, without the hierarchy found in typical filesystems. S3 entities include buckets and objects. The bucket is the container for the object.

In Cloudera Navigator, S3 entities include the following:

S3 Bucket
Directory - Although S3 entities are limited to buckets and objects in those buckets, S3 supports the concept of a folder that can be used to organize objects. Folders in S3 are extracted as directories in Navigator.
File

Implicit Folders

Navigator creates implicit S3 folders to mimic the behavior of a file system. For example, for an object with key real_estate/sales/pending, Navigator creates a file with the path real_estate/sales/pending, and also creates two directories: real_estate and real_estate/sales. For example, in the S3 bucket implicit-folder-test, if you create the folder structure /implicit/implicit2 and add the file explicit to the implicit2 folder, Cloudera Navigator shows the following for this object when extracted:

The directory/file combination is labeled as a Path.

You work with S3 entities in Navigator much as you would with entities for HDFS. For information on S3 entity properties, see S3 Properties.

For more information about Amazon S3, see the Amazon S3 documentation.

Enabling Cloudera Navigator Access to Amazon S3

To configure Navigator for S3, you must configure AWS credentials for Cloudera Manager and enable Cloudera Navigator to access data written to S3 buckets.

You configure AWS Credentials to specify the Access Key Authentication type.

This type of authentication requires an AWS Access Key and an AWS Secret key that you obtain from Amazon. For more information about setting up keys in AWS, see Creating an IAM User in Your AWS Account in the AWS Identity and Access Management documentation. Cloudera Manager stores these values securely and does not store them in world-readable locations. The credentials are masked in the Cloudera Manager Admin console, encrypted in the configurations passed to processes managed by Cloudera Manager, and redacted from the logs.

Minimum Required Role: User Administrator (also provided by Full Administrator)

To enable Cloudera Navigator access to Amazon S3, you must add AWS Credentials for Amazon S3 and then enable Navigator access. If you have already added AWS credentials, skip to step 5:

Open the Cloudera Manager Admin Console.
Click Administration > AWS Credentials.
Click Add and select Access Key Authentication. This authentication mechanism requires you to obtain AWS credentials from Amazon.
1. Enter a Name for this account. The name can contain alphanumeric characters, hyphens, underscores, and spaces.
2. Enter the AWS Access Key ID.
3. Enter the AWS Secret Key.
Important: Although AWS offers two types of authentication—IAM Role-based Authentication and Access Key Authentication—you must specify Access Key Authentication for Cloudera Navigator. IAM Role-based Authentication is not supported.
Click Add. The Connect to Amazon Web Services screen displays.
Click the Enable for Cloudera Navigator link.
Restart the Cloudera Navigator Metadata server to enable access.

Extracted S3 information should now be available in Cloudera Navigator.

Eventual Consistency

Amazon S3 uses an eventual consistency model. To provide high availability, consistency is informally guaranteed: Eventually, an item returns the last updated value to all accesses of that item, assuming no new updates for a period of time.

Because S3 uses eventual consistency, it might take some time for the S3 object to appear in Navigator. In addition, you might notice discrepancies between the object in Navigator and the object in S3. In Navigator, if you do not immediately see an S3 object that you created or do not see modifications that you made, that does not mean the object does not exist or was not successfully edited. In most cases, the lag time associated with eventual consistency is causing the object to not appear in Navigator or to not match the most recent version in S3.

S3 Event Notification

In Amazon S3, you enable a bucket to send notification messages whenever certain events occur. Cloudera Navigator uses Amazon Simple Queue Service (Amazon SQS) to extract S3 information.

Amazon SQS is a distributed, highly scalable hosted queue for storing messages. Navigator pulls data from the SQS queue. For more information about Amazon SQS, see Getting Started with Amazon SQS.

By default, Navigator sets up these queues and configures S3 event notification for each bucket for you. Navigator does not overwrite existing S3 event notifications. However, if any buckets have existing S3 event notifications, you must use "bring your own queue" and use an Amazon SNS "fanout". In a "fanout" scenario, an Amazon SNS message is sent to a topic and then replicated and pushed to multiple Amazon SQS queues, HTTP endpoints, or email addresses.

For more information about configuring Navigator data extraction, including "bring your own queue", see S3 Data Extraction for Navigator.

Setting API Limits

You can set an API limit for Amazon S3 API and SQS API. You are billed on a monthly basis, depending on your usage; the billing cycle resets each month. By setting these API limits, you can manage the monthly cost of using the APIs.

To set API limits, add the following in the Navigator Metadata Server Advanced Configuration Snippet (Safety Valve) for cloudera-navigator.properties in Cloudera Manager:

nav.aws.api.limit=any_int

Once your API limit is reached, Navigator suspends extraction until the next 30-day interval begins. Then at that point, Navigator extracts any data that was not extracted during the time activity was suspended.

Cloudera Navigator does not indicate if your use of the API exceeds the monthly limit; monitor your monthly use of the APIs to manage your costs.

Limitations of Navigator for S3

S3 entities in Cloudera Navigator work in much the same way that HDFS entities do. However, the current release of Navigator has some limitations related to S3:

Only one instance of Navigator can be configured per S3 account, and Navigator can use only one AWS credential.
IAM role-based authentication is not supported.
Extraction limitations:
- Navigator extracts only user-defined metadata in S3. System-defined metadata types are not extracted.
- Navigator does not extract tags for S3 buckets and objects.
- Navigator extracts only the latest versions in S3; it does not extract historical versions.
- AWS supports unnamed directories, but Navigator does not extract them.
Auditing is not available for S3.
Lineage is supported for Hive, Impala, and MapReduce on S3. Other types are not supported.
MapReduce glob paths are not supported.
S3 object removal with Object Lifecycle Management is not supported.
For example, if you set a lifecycle rule to automatically remove any objects older than 10 days, that delete event is not be tracked by SQS and therefore not tracked by Navigator.

To use lifecycle rules with Navigator extraction, you must use only bulk extraction and not incremental extraction.

Data Stewardship Dashboard

S3 Data Extraction for Navigator