Configuring Extraction for Amazon S3

Depending on the specifics of the Amazon S3 bucket targeted for extraction by Cloudera Navigator, the configuration process follows one of two alternative paths:

Default configuration—The default configuration is available for Amazon S3 buckets that have no existing Amazon SQS or Amazon SNS services configured. During configuration, Cloudera Navigator accesses the configured AWS account, performs an initial bulk extract from the Amazon S3 bucket, sets up Amazon SQS queues in each region with buckets, and sets up event notifications for each bucket for subsequent incremental extracts—all transparently to the Cloudera Manager administrator handling the configuration process.
Custom configuration—Custom configuration is required for any Amazon S3 bucket that is currently using Amazon SQS (has queues set up for other applications, for example) or is setup for notifications using Amazons SNS. In these cases you must manually configure a new queue—bring your own queue—and in some cases, additionally configure Amazon SNS for fanout.

Continue reading:

AWS Credentials Requirements

Cloudera Manager can have multiple AWS credentials configured for various purposes at any given time. These are listed by name on the AWS Credentials page which is accessible from the Cloudera Manager Admin Console, under the Administration menu. However, there are specific constraints on AWS credentials for Cloudera Navigator as follows:

Navigator supports a single key for authentication; only one AWS credential can be used for a given Navigator instance. Navigator can extract metadata from any number of S3 buckets, assuming the buckets can be accessed with the configured credential.
An AWS credential configured for connectivity from one Cloudera Navigator instance cannot be used by another Cloudera Navigator instance. Configuring the same AWS credentials for use with different Cloudera Navigator instances can result in unpredictable behavior.
Cloudera Navigator requires an AWS credential associated with an IAM user identity rather than an IAM role.
Any changes to the AWS credentials (for example, if you rotate credentials on a regular basis) must be for the same AWS account (IAM user). Changing the AWS credentials to those of a different IAM user results in errors from the Amazon Simple Queue Service (used transparently by Cloudera Navigator). If a new key is provided to Cloudera Navigator, the key must belong to the same AWS account as the prior key.
For the default configuration, the account for this AWS credential must have administrator privileges for:
- Amazon S3
- Amazon Simple Queue Service (SQS)
For the custom configurations, the account needs privileges for Amazon S3, Amazon SQS, and for Amazon Simple Notification Service (SNS).

Default Configuration

The steps below assume that you have the required AWS credentials for the IAM user with the Amazon S3 bucket. Amazon Web Services (AWS) account (an IAM user account) and that you can use the AWS Management Console . The AWS credentials for the IAM user are configured for Cloudera Navigator using the Cloudera Manager Admin Console during the configuration process below.

Configuring Cloudera Navigator for Amazon S3

At the end of the configuration process detailed below, Cloudera Navigator authenticates to AWS using the credentials and performs an initial bulk extract of entities from the Amazon S3 bucket. It also sets up the necessary Amazon SQS queue (or queues, one for each region) to use for subsequent incremental extracts.

The steps below assume you have the required AWS credentials available.

Log in to the Cloudera Manager Admin Console.
Click Administration > AWS Credentials. The AWS Credentials page displays, listing any existing credentials that have been setup for the Cloudera Manager cluster.
Click the Add Access Key Credentials button. In the Add Access Key Credentials page:
1. Enter a Name for the credentials. The name can contain alphanumeric characters and can include hyphens, underscores, and spaces but should be meaningful in the context of your production environment. Use the name of the Amazon S3 bucket or its functionality, for example, cust-data-raw or post-proc-results.
2. Enter the AWS Access Key ID.
3. Enter the AWS Secret Key.
Click Add. The Edit S3Guard:aws-cred-name page displays, giving you the option to enable S3Guard for the S3 bucket.
- Click the Enable S3Guard box only if the AWS credential has privileges on Amazon DynamoDB and if the preliminary S3Guard configuration is complete. See Cloudera Administration for details about Configuring and Managing S3Guard.
Click Save.
Note: Cloudera Manager stores the AWS credential securely in a non-world readable location. The access key ID and secret values are masked in the Cloudera Manager Admin Console, encrypted before being passed to other processes, and redacted in the logfiles.
The Connect to Amazon Web Services page displays the credential name and services available for its use:
Click the Enable S3 Metadata Extraction link in the Cloudera Navigator section of the page. A Confirm prompt displays, notifying you that Cloudera Navigator must be manually restarted after this change.
Click OK to enable the connection.
At the top of the Cloudera Manager Admin Console, click the Stale Configuration restart button when you are ready to restart Cloudera Navigator.

Metadata and lineage for Amazon S3 buckets will be available in the Cloudera Navigator console along with other sources, such as HDFS, Hive, and so on. It may take several minutes to complete the initial extraction depending on the number of objects stored on the Amazon S3 bucket.

Custom Configurations

Follow these steps for Amazon S3 buckets that are already configured with queues or event notifications. Custom configurations include configuring your own queue (BYOQ) and BYOQ with Fan-out, as detailed below.

Configuring Your Own Queues

Sometimes referred to as Bring Your Own Queue (BYOQ), configuring your own queue is required if the Amazon S3 bucket being targeted for extraction by Cloudera Navigator already has existing queues or is configured for notifications. The process involves stopping Cloudera Navigator and then using the AWS Management Console for the following tasks:

Creating and configuring an Amazon Simple Queue Service (SQS) queue for Cloudera Navigator for each region in which the AWS (IAM user) account has Amazon S3 buckets.
Configuring Amazon Simple Notification Service (SNS) on each bucket to send Create, Rename, Update, Delete (CRUD) events to the Cloudera Navigator queue.
Configuring the bucket for Notification Fan-Out if needed to support existing notifications configured for other applications.
Adding a Policy for the appropriate extraction process (Bulk + Incremental, Bulk Only) to the IAM user account.
Adding the Policy for event notifications to the IAM user account.

Configure the Queue for Cloudera Navigator

This manual configuration process requires stopping Cloudera Navigator. You must create a queue for each region that has S3 buckets.

Log in to Cloudera Manager Admin Console and stop Cloudera Navigator:
- Select Clusters > Cloudera Management Service
- Click the Instances tab.
- Click the checkbox next to Navigator Audit Server and Navigator Metadata Server in the Role Type list to select these roles.
- From the Actions for Selected (2) menu button, select Stop
Log in to the AWS Management Console with AWS account (IAM user) and open the Simple Queue Service setup page (select Services > Messaging > Simple Queue Service. Click Create New Queue or Get Started Now if region has no configured queues.)
For each region that has Amazon S3 buckets, create a queue as follows:
1. Click the Create New Queue button. Enter a Queue Name, click the Standard Queue (not FIFO), and then click Configure Queue. Configure the queue using the following settings:
  
  Default Visibility Timeout 10 minutes
  
  Message Retention Period 14 days
  
  Delivery Delay 0 seconds
  
  Receive Message Wait Time 0 seconds
2. Select the queue you created, click the Permissions tab, click Add a Permission, and configure the following in the Add a Permision to... dialog box:
  
  Effect Allow
  
  Principal Everybody
  
  Actions SendMessage
3. Click the Add Conditions (optional) link open the condition fields and enter the following values:
  
  Qualifier None
  
  Condition ArnLike
  
  Key aws:SourceArn
  
  Value arn:aws:s3::*:*
4. Click Add Condition to save the settings.
5. Click Add Permission to save all settings for the queue.

Repeat this process for each region that has Amazon S3 buckets.

Configure Event Notification for the Queues

After creating queues for all regions with Amazon S3 buckets, you must configure event notification for each Amazon S3 bucket. Assuming you are still logged into the AWS Management Console:

Navigate to the Amazon S3 bucket for the region (Services > Storage > S3).
Select the bucket.
Click the Properties tab.
Click the Events settings box.
Click Add notification.
Configure event notification for the bucket as follows:
Name nav-send-metadata-on-change
Events
ObjectCreated(All)

ObjectRemoved(All)
Send to SQS Queue

SQS queue Enter the name of your queue

Defining and Attaching Policies

Event Notification Policy for Custom Queues

To enable event notification for an custom queue, create the following policy by copying the policy text and pasting it in the policy editor, and then attaching it to the Cloudera Navigator user in AWS.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1481678612000",
            "Effect": "Allow",
            "Action": [
                "sqs:DeleteMessage",
                "sqs:DeleteMessageBatch",
                "sqs:GetQueueAttributes",
                "sqs:ReceiveMessage"
            ],
            "Resource": "*"
        },
        {
            "Sid": "Stmt1481678744000",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListAllMyBuckets",
                "s3:ListBucket",
                "s3:GetObject",
                "s3:GetObjectAcl",
                "s3:GetBucketNotification",
                "s3:PutBucketNotification"
            ],
            "Resource": [
                "arn:aws:s3:::*"
            ]
        }
    ]
}

Extraction Policies for Custom Queues

Custom configurations require a valid extraction policy be defined and attached to the AWS user account associated with the Amazon S3 bucket. The policy is a JSON document that specifies the type of extraction. As mentioned in Overview of Amazon S3 Extraction Processes, the two types of extraction are as follows:

Bulk + Incremental—This is the recommended approach for both cost and performance reasons and is used by the Default Configuration process automatically.
Bulk Only—This approach is recommended for proof-of-concept deployments. It is required for the BYOQ with Fan-out configuration. In addition to applying the policy as detailed below, this approach also requires setting the nav.s3.extractor.incremental.enable property to false. See Setting Properties with Advanced Configuration Snippets and the Cloudera Navigator Properties for Amazon S3 for details.

To configure the policy:

Log in to the AWS Management Console using the IAM user account associated with the target Amazon S3 bucket.
Copy the appropriate JSON text from the table below Extraction Policies JSON Referenceand paste into the AWS Management Console policy editor for the Navigator user account on (the IAM user) through the AWS Management Console.

Extraction Policies JSON Reference

Bulk + Incremental (Recommended)	Bulk Only
Initial bulk process extracts all metadata. Subsequent incremental process extracts changes only (CRUD). Cannot be used with Amazon S3 buckets that use event notification.	Must be used for Amazon S3 buckets configured for event notifications.
{ "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1481678612000", "Effect": "Allow", "Action": [ "sqs:CreateQueue", "sqs:DeleteMessage", "sqs:DeleteMessageBatch", "sqs:GetQueueAttributes", "sqs:GetQueueUrl", "sqs:ReceiveMessage", "sqs:SetQueueAttributes" ], "Resource": "" }, { "Sid": "Stmt1481678744000", "Effect": "Allow", "Action": [ "s3:GetBucketLocation", "s3:ListAllMyBuckets", "s3:ListBucket", "s3:GetObject", "s3:GetObjectAcl", "s3:GetBucketNotification", "s3:PutBucketNotification" ], "Resource": [ "arn:aws:s3:::" ] } ] }	{ "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1481676614000", "Effect": "Allow", "Action": [ "s3:GetBucketLocation", "s3:ListAllMyBuckets", "s3:ListBucket", "s3:GetObject", "s3:GetObjectAcl" ], "Resource": [ "arn:aws:s3:::*" ] } ] }

Bulk + Incremental (Recommended)

Bulk Only

Initial bulk process extracts all metadata.
Subsequent incremental process extracts changes only (CRUD).
Cannot be used with Amazon S3 buckets that use event notification.

Must be used for Amazon S3 buckets configured for event notifications.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1481678612000",
            "Effect": "Allow",
            "Action": [
                "sqs:CreateQueue",
                "sqs:DeleteMessage",
                "sqs:DeleteMessageBatch",
                "sqs:GetQueueAttributes",
                "sqs:GetQueueUrl",
                "sqs:ReceiveMessage",
                "sqs:SetQueueAttributes"
            ],
            "Resource": "*"
        },
        {
            "Sid": "Stmt1481678744000",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListAllMyBuckets",
                "s3:ListBucket",
                "s3:GetObject",
                "s3:GetObjectAcl",
                "s3:GetBucketNotification",
                "s3:PutBucketNotification"
            ],
            "Resource": [
                "arn:aws:s3:::*"
            ]
        }
    ]
}

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1481676614000",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListAllMyBuckets",
                "s3:ListBucket",
                "s3:GetObject",
                "s3:GetObjectAcl"
            ],
            "Resource": [
                "arn:aws:s3:::*"
            ]
        }
    ]
}

Setting Properties with Advanced Configuration Snippets

Certain features require additional settings or changes to the Cloudera Navigator configuration. For example, configuring BYOQ queues to use bulk-only extraction requires not only creating and attaching the extraction policy but also adding the following snippet to the Navigator Metadata Server Advanced Configuration Snippet (Safety Valve) for cloudera-navigator.properties setting:

nav.s3.extractor.incremental.enable=false

To change property values by adding an advanced configuration snippet:

Log in to the Cloudera Manager Admin Console.
Select Clusters > Cloudera Management Service.
Click Configuration.
Click Navigator Metadata Server under the Scope filter, and click Advanced under the Category filter.
Enter Navigator Metadata Server Advanced Configuration Snippet (Safety Valve) for cloudera-navigator.properties in the Search field to find the property.
Enter the property and its setting as a key-value pair, for example:
```
property=your_setting
```
in:
Click Save Changes.
Restart the Navigator Metadata Server instance.

Cloudera Navigator Extraction Behavior for Amazon S3

Objects from S3 buckets appear as you would expect directories and files to appear in Navigator. However, there are some behaviors that are specific to S3 source types:

Unnamed directories
It is possible to place files in unnamed directory in an S3 bucket, such as s3://mybucket//myfile. Navigator does not extract files inside unnamed directories.
Deleted implicit directories
When adding files to an S3 bucket, S3 may create implicit directories for the file if the directories specified in the file path do not already exist. When a file is deleted and its implicit directories are also removed on S3, Navigator will show the file as deleted but will not delete the implicit directories. You can filter these directories from the Navigator search results by setting implicit:false in the search query.
Inconsistency delays
Inconsistencies that occur in AWS can delay Navigator extraction of metadata and lineage from Amazon S3. When Cloudera Navigator detects an inconsistency, extraction may stop until the inconsistency is resolved in AWS. Cloudera Navigator will retry at the next scheduled extraction.

Cloudera Navigator Properties for Amazon S3

The table below lists the Navigator Metadata Server properties (cloudera-navigator.properties) that control extraction and other features related to Amazon S3. These properties can be set using the Cloudera Manager Admin Console to set properties in the advanced configuration settings.

Changing any of the values in the table requires a restart of Cloudera Navigator.

Option	Description
`nav.aws.api.limit`	Default is `5,000,000`. Maximum number of Amazon Web Services (AWS) API calls that Cloudera Navigator can make per month.
`nav.sqs.max_receive_count`	Default is `10`. Number of retries for inconsistent SQS messages (inconsistent due to eventual consistency).
`nav.s3.extractor.enable`	Default is `true` when an AWS credential has been configured to extract metadata from Amazon S3.
`nav.s3.extractor.incremental.auto_setup.enable`	Default is `true`. Enables Cloudera Navigator to set up Amazon SQS queues to receive notifications from Amazon S3 events. Set to `false` to disable the automatic setup and custom configure your own queue (BYOQ with Fan-out).
`nav.s3.extractor.incremental.batch_size`	Default is `1000`. Number of messages held in memory during the extraction process.
`nav.s3.extractor.incremental.enable`	Default is `true`. Enables incremental extraction. Setting to `false` disables incremental extraction and effectively enables bulk-only extraction.
`nav.s3.extractor.incremental.event.overwrite`	Default is `false`. Prevents any existing event notifications from being overwritten by Cloudera Navigator auto-generated queues created during default configuration process. Important: Do not set to `true` unless you fully understand the impact of overwriting event notifications. Setting to `true` may overwrite critical existing business logic.
`nav.s3.extractor.incremental.queues`	No default queue. Used by the custom configuration only. Specify a list of queues for the custom configuration using the JSON template. The list of queues should include the existing queues already in use and the newly configured queue that Cloudera Navigator will use for incremental extracts.
`nav.s3.extractor.max_threads`	Default is `3`. The number of extractors (worker processes) to run in parallel.
`nav.s3.home_region`	Default is `us-west-1`. AWS region closest to the cluster and the Cloudera Navigator instance. Select the same AWS region (or the nearest one geographically) to minimize latency for API requests.
`nav.s3.implicit.batch_size`	Default is `1000`. Number of Solr documents held in memory when updating the state of implicit directories.

Using Cloudera Navigator with Amazon S3

Cloudera Navigator APIs