Security on AWS: Best Practices

As discussed in Cloudera Enterprise in the Cloud, Cloudera clusters can be deployed to the cloud using any of the three leading cloud providers, including Amazon Web Services (AWS). Unlike on-premises clusters running securely under the complete control of your IT department, cloud-based clusters require you to manage greater risk.

For example, organizations must connect to the cloud provider and set up the instances to support the cluster, and then install and launch a Cloudera cluster. Whether the cluster runs for 5 minutes or 5 months, it must be configured so that it is available only to the appropriate people and processes, and that it is always available when they need it.

Furthermore, although cloud providers can handle many aspects of security for you, Cloudera adds enterprise security capabilities for clusters beyond what cloud providers can offer. Cloudera customers can be in complete control of their security—managing encryption keys outside the control of the cloud provider or enabling users to authenticate to the cloud through the organization’s Active Directory (or other LDAP) server, for example. How can you get the best of both worlds—on premises and cloud—while ensuring security for your system and its data? This guide aims to help you do just that.

This guide focuses on security best practices for Cloudera clusters deployed to the Amazon Web Services (AWS) cloud.

Continue reading:

Preliminary Planning
Network Security
Transient Clusters Using Amazon S3
Persistent Clusters Using Amazon S3
Persistent Clusters Using HDFS
On-going Security Best Practices

Preliminary Planning

Organizations deploying clusters to any of the public clouds can achieve cost-savings, productivity, high availability, and many other benefits by carefully considering the best practices for different deployment patterns in the context of a specific use case.

In addition to meeting cost-savings and other goals, organizations must also ensure that cloud deployments meet all relevant privacy, integrity, and confidentiality requirements. For example, organizations in highly regulated industries may need to keep extensive audit trails and be able to track data lineage over time.

Identifying the security requirements and how to meet them in your cloud deployment starts by analyzing data inputs and outputs, the workload type, and the user profile:

Identify the people and processes that need to use the cluster: Are they members of the same division in your organization?
Identify your users and the specific levels of access to cluster resources and data needed so you can effectively shape your identity, authentication, and authorization requirements.
Do you need to comply with industry or government regulations for privacy, confidentiality, or other security requirements?
Do you need to be able to identify distinct users or processes that acted on any data as part of a complete audit trail, for example?
For a given cluster or for a specific dataset on Amazon S3, should all users who have access be allowed to see all the data? If not, you must set up a multi-tenant cluster.
Does your organization use an LDAP-based directory (for example, Microsoft Active Directory) for identity management, and if so, do you want to leverage that service when you deploy to the cloud? Or would your organization rather manage an additional set of credentials for all users, just for use in the cloud?
Identify the locations, format, structure, of data sources.
Identify encryption mechanisms, keys, and other specific details about how you encrypt data at rest now or plan to in the future.
Test different sample workloads from your production system to determine the optimal deployment architecture.

These are just some of the questions to consider before deploying any cluster, with security in mind.

This guide highlights best practices for various architectural patterns identified by Cloudera that broadly distinguish between:

Lifetime of the cluster (transient or persistent)
Tenancy or usage profile (single-user or multi-tenant)

Other distinguishing characteristics of the architectural patterns are shown in the table below.

Lifetime	Tenancy	Key Components	Data Source/Target
Transient	Single-user	Apache Hive, Apache Spark, Hive on Spark, HDFS	Amazon S3
Persistent	Multi-tenant	Apache Impala, Apache Spark, Hive on Spark, HDFS	Amazon S3
Persistent	Multi-tenant	Apache HBase, Apache Spark	HDFS, Amazon S3

Regardless of type and architectural pattern, all cloud deployments to AWS must first consider network security.

Network Security

Deploying a cluster to the Amazon public cloud starts by configuring and securing the necessary network infrastructure hosted by the cloud provider. For clusters deployed to the Amazon Web Services cloud, this requires an Amazon Virtual Private Cloud (VPC).

Amazon automatically provisions a default VPC for each customer AWS account. The default VPC includes several related networking infrastructure entities, including a default subnet, default security group, default routing table, and so on. The defaults are fine for proof-of-concept deployments, but follow the best practices below for production systems.

Setting up secure networking from your premises to AWS is critical to the security of both your corporate network or data center and the cluster you deploy to the Amazon cloud. Cloudera recommends the following:

Create and Configure a VPC

Use Amazon Identity and Access Management (IAM) to create separate user accounts for the various divisions in your organization that will deploy clusters to the cloud. Do not create all your cloud instances under your root Amazon account but instead create an IAM admin user and group.

Create a VPC. The VPC will support the instances you want to deploy to the cloud, including an instance needed for Altus Director (if you plan to use that deployment tool), and for the specific EC2 instances that you will create to support the cluster or clusters, for specific workloads.
Create public and private subnets to isolate traffic within the VPC. Plan out the IP addresses you will need for the security groups needed to secure the cluster.
Add and configure a VPC Endpoint so the private subnet can connect to your Amazon S3 storage.
Add a VPN (virtual private network) to the VPC to securely connect your on-premises data center to the Amazon cloud. The VPN lets your on-premises network communicate securely with the VPC. Amazon offers four types of VPN Connections, so pick the one that’s best for your use case:
- AWS hardware VPN: Supports IPsec VPN connections
- AWS Direct Connect: Use this for a dedicated secure connection between your corporate network and Amazon AWS. This choice requires coordination between your organization’s network infrastructure team and Amazon AWS, as well as hardware setup and configuration. See AWS Direct Connect for details.
- AWS VPN CloudHub: Supports multiple remote networks. Use this if you have several remote branch offices, for example, that you want to connect to the VPC.
- Software VPN: Runs on an EC2 instance using third-party software.

Create Security Groups

A security group is the Amazon VPC mechanism that acts as a whitelist for the VPC. The security group contains rules that you define for the port numbers and the protocols allowed for inbound and outbound network traffic. For example, the default VPC has a default security group that allows all outbound traffic but no inbound traffic. Cloudera recommends creating two different security groups in the VPC, as follows:

Create one security group for the cluster’s edge node (or nodes, for high availability configurations). Also known as the “gateway node,” an edge node runs instances of specific gateway roles that let end-users and applications use cluster services.
Give the edge node security group unlimited outbound access to the public internet.
Limit inbound access to specific IP addresses from your corporate network or other approved IP addresses.
Use IP addresses from the public subnet to make specific gateway roles accessible to users.
Create a second security group for the other nodes of the cluster—the master nodes, worker nodes, and management nodes. The EC2 instances comprising the cluster must be able to communicate with each other through various ports, so these can be configured in the private subnet.
Give this security group outbound access to the internet, for use with other AWS services and Amazon S3 storage and to access external repositories for software updates. For example, the EC2 instance that’s used for Altus Director must be able to access the software repository to download and install the Altus Director software.
Use private IP addresses for the nodes to communicate internally.
Do not use a public IP address for Cloudera Manager.

For more information, see:

Cloudera Enterprise Reference Architecture for AWS Deployments, specifically, the “Networking, Connectivity, and Security” section.
Getting Started on Amazon Web Services (AWS) in the Altus Director documentation, specifically, Setting up the AWS Environment.
Amazon Virtual Private Cloud (VPC) and Amazon EC2 (Network and Security) documentation.

Transient Single-User Clusters Using Amazon S3

Architectural Pattern	Transient single-user clusters backed by Amazon S3
Cluster Services	Apache Hive, Apache Spark, Spark on Hive, and HDFS
Dependencies	Altus Director running on a persistent Amazon EC2 instance
Use Case	Data engineering, ETL for data warehousing, data pipelines

Transient single-user clusters are ideal for extract, transform, and load (ETL) data pipelines and other workloads that have relatively short durations. Cloud resources are shut down when the workload completes. The processes may be initiated and managed by a single user or a small group, or launched by means of a cron job.

In general, transient single-user clusters make sense for any workload that has a limited lifespan and small number of users with equivalent access privileges. These are also comparatively easy to secure—only coarse-grained authorization privileges are needed because the assumption is that anyone given access to the cluster is entitled to access all of the data associated with that cluster.

Continue reading:

Example Use Case
Identity, Authentication, and Authorization
Encryption (Data in Transit)
Encryption (Data at Rest)
Auditing

Example Use Case

Assume that data from several different applications and database systems, including extracts from an Oracle database, customer reports from Salesforce, and .csv files from legacy systems, is uploaded to Amazon S3 (Object Storage), to a raw_input bucket.

Twice a week, a member of the etl_team uses Altus Director to launch EC2 instances, spins up the cluster, runs the workload, and then shuts down the cluster when the workload completes. As the job runs on the cluster, results are written back to Amazon S3 to the etl-results bucket.

Given the small number of possible users—all of whom have the same permissions to the source data and to the resources needed to run the job—this workload can use Amazon Identity and Access Management for identity, authentication, and authorization within the cluster, and can use Altus Director to launch and manage the cluster when needed.

Identity, Authentication, and Authorization

For transient single-user clusters, use Amazon Identity and Access Management (IAM) to create an IAM role that can be used to launch the EC2 instances and run all aspects of the workload.

The Amazon IAM role takes care of both authentication and authorization for the cluster in the Amazon cloud, and access to the Amazon S3 storage bucket. When you set up an IAM role and use the profile to launch the cluster as described below, any user logging into the cluster can access all data in the Amazon S3 bucket specified in the policy, without the need to provide any other credentials.

The setup process is generally as follows:

Use your AWS Management Console to create an IAM role for EC2. When you do this, the console creates the role and an instance profile that will be available to use when you launch your EC2 instances.
Add a policy to the Amazon S3 bucket (Bucket Policy) that specifies what authenticated users (the IAM role is authenticated by AWS) can do.
Important: Any user logging in to the cluster can access all data in the Amazon S3 bucket specified in the policy, without the need to provide any other credentials. Be aware of this exposure and make sure this approach is appropriate for your use case.
For example, here’s a policy that let's the IAM role list, read, write, and delete objects on the Amazon S3 etl-results bucket. Both bucket-name and bucket-name/* are required in the Resource list, as shown in this example (etl-results/*, etl-results):
```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:DeleteObject",
                "s3:GetObject",
                "s3:GetObjectAcl",
                "s3:PutObject",
                "s3:PutObjectAcl",
                "s3:GetBucketAcl",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3::: etl-results/*",
                "arn:aws:s3::: etl-results"
            ]
        }
    ]
}
```
Launch the EC2 instances using the instance profile created for the IAM role. For ease of deployment, use Altus Director to launch transient single-user clusters. Altus Director provides two different tools for deploying clusters: Altus Director UI or the Altus Director command-line:
- Altus Director UI is a web-server-hosted console that can be accessed at https://your.instance.hosting.director.ui:7187. Enter the profile name in the Advanced Options section of an instance template, as shown here:
- Altus Director command-line lets you submit the details in a cluster configuration file to the Altus Director server. For example, here is the iamProfileName setting from the sample template for the Cloudera Enterprise Reference Architecture for AWS Deployments (obtain scripts from GitHub's Altus Director scripts section):
```
...
# Name of the IAM Role to use for this instance type
# iamProfileName: iam-profile-REPLACE-ME
iamProfileName: etl_team
...
```

Whether deployed using the Altus Director UI console or the command-line, the EC2 instance is launched using the instance profile (‘iamProfileName’) containing the IAM role, and the result is that your EC2 instances can all use the Amazon S3 bucket associated with the profile.

For more information:

Encryption (Data in Transit)

Cloudera generally recommends configuring TLS/SSL to encrypt network communications. However, for transient clusters that may have no active management involved and for which the Cloudera Manager instance has been deployed merely to facilitate cluster installation and deployment, TLS/SSL may not be a strict requirement. For short-lived clusters dedicated to limited sets of processing tasks running under the control of a single user account or an IAM role, the setup required for TLS/SSL may be more costly in terms of time than the use case demands.

If you do have a strict requirement to encrypt communications within the cluster, however, you can script this yourself and use Altus Director to automate the process. See How-to: Deploy a Secure Enterprise Data Hub on AWS for more information.

Encryption (Data at Rest)

Encrypt data-at-rest for any production cluster. For clusters backed by Amazon S3, Cloudera supports:

Server-Side Encryption with Amazon S3-Managed Keys (SSE-S3)—Keys are generated and managed by Amazon.
Server-Side Encryption using Amazon Key Management Server (SSE-KMS)—Keys are created and managed using AWS Key Management Server. Supply the key name in the IAM profile (the IAM user launching the cluster must have permission to use the key).

Cloudera clusters fully support both these options, so use the mechanism that makes the most sense for your specific use case.

For more information:

How to Configure Encryption for Amazon S3 in Cloudera Security
Writing Encrypted Data to Secure S3 Buckets From Altus Jobs in the Cloudera Altus documentation
Protecting Data Using Server-Side Encryption in the Amazon AWS documentation

Auditing

Transient clusters that process data pipelines and ETL jobs typically do not require extensive audit trails. You can enable Amazon S3 logging (through the AWS Management Console) to track events that occur on the Amazon S3 storage bucket. However, AWS has limited auditing capabilities, so be aware of the limitations in the context of your workload. For example, according to Amazon (Server Access Logging documentation), although Amazon S3 server logs can provide “an idea of the nature of traffic against your bucket," they are not meant as “a complete accounting of all requests." This example of an Amazon S3 log message captures an unauthorized access request:

Persistent Multi-Tenant Clusters Using Amazon S3

Architectural Pattern	Persistent multi-tenant clusters backed by Amazon S3
Cluster Services	Apache Impala, Apache Hive, Hive on Spark, and HDFS
Dependencies	Kerberos, Sentry, TLS/SSL,
Use Case	Data Warehouse

From a security perspective, persistent clusters make sense for use cases in which:

part (but not all) of a given dataset must be shared but at a very granular level—not only files, but specific columns and rows within tables;
the number of clusters needed to support the per-user (or per-user group with the same permissions) would be too costly and unmanageable;
casual users do not want to learn about cloud infrastructure, or spin-up their own clusters, or wait for a cluster to start up; or
users want to get access to cluster quickly, based on their identity in the organization’s enterprise directory.

The best practices included in this section are the same as Cloudera recommendations for conventional on-premises clusters with some exceptions, where support for the Amazon S3 storage is not yet fully implemented in a Hadoop ecosystem component or security mechanism.

Continue reading:

Example Use Case
Identity
Authentication
Authorization
Encryption (Data in Transit)
Encryption (Data at Rest)
Auditing