Self-Managed Cloudera SDX: Reference Architecture

This document describes the self-managed configuration of Cloudera Shared Data Experience (Cloudera SDX) on AWS in CDH 5.13. Self-managed Cloudera SDX can be used for CDH clusters launched with Cloudera Manager or Cloudera Director where you want to configure shared data for supported CDH services.

Steps Required to Use Self-Managed Cloudera SDX

  1. Create an Amazon Relational Database Service (RDS) instance, using MySQL as the database type, and set up your Hive metastore (HMS) and Sentry databases there. Make note of the host, port, user, password, and database name for each. All of the cluster creation activities described below should occur in the same Amazon Virtual Private Cloud (VPC) as the RDS. At RDS creation time, be sure to scale up your RDS configuration parameters, such as the maximum number of connections, according to the number of clusters that will share the RDS.
  2. When using Cloudera Manager to create a cluster, enable Kerberos. Once the cluster is created, configure Hive with the Hive Metastore Database Password, Hive Metastore Database Name, Hive Metastore Database User, Hive Metastore Database Port, and Hive Metastore Database Host. Then restart the Hive service. Do the same for Sentry. (Using SDX Sentry requires using SDX HMS.) Multiple clusters may be managed by a single Cloudera Manager instance; this is a supported SDX configuration.
  3. When using Cloudera Director to create a cluster, follow the SDX Director template. This template has a number of REPLACE-ME values. You will need to provide values for those according to your particular setup.
  4. Enable access to S3 for the hosts in your cluster in one of the following two ways:
    • For best security, configure all hosts with IAM role-based authentication that allows access to S3 so you do not use S3Guard and do not need to add the S3 connector service.
    • For clusters that do not use IAM role-based authentication, configure the Amazon S3 connector in your cluster. For your Credentials Protection Policy, choose Less Secure.

CDH Services with Cloudera SDX Support

This section describes support and limitations for using SDX with the Hive Metastore, Sentry, and Cloudera Navigator.

Hive Metastore

All Hive metastore (HMS) features are supported in SDX, including High Availability (HA) mode. There are some usage patterns to follow when using SDX HMS:
  • All tables should be stored in locations that are readable by all clusters. For instance, all tables could be stored in S3. Tables stored in HDFS will not work with HMS in SDX, since the distinct clusters will understand the URI of the table location to refer to their local HDFS. The data and metadata will end up incoherent in this case.
  • Concurrent overlapping access to the same piece of metadata should only be done when all accesses are reads; none can be writes. Two accesses overlap when they access some of the same metadata at the same time. Overlapping writes can create inconsistent metadata. There is no built-in mechanism to enforce this, so the users of different clusters must ensure that any overlapping accesses are exclusively reads.
SDX has been available for HMS since CDH 5.10. For more information, see the blog post, How To Set Up a Shared Amazon RDS as Your Hive Metastore. This blog post has links to some AWS setup materials and explains overlapping in more detail.

Sentry

Sentry is supported in SDX. Just as in previous versions, SDX Sentry supports setting permissions via Hue, Beeline, or impala-shell. SDX Sentry supports HMS HA and Sentry HA, and it can work between two clusters managed by distinct Cloudera Manager instances. Usage notes:
  • For SDX Sentry, HMS must also be configured to use SDX.
  • SDX Sentry permissions are not granted per-cluster: all clusters see the same permissions data. Every Hive service must share the same "Server Name for Sentry Authorization".
  • As in HMS, concurrent overlapping access is not supported unless all accesses are reads. This means that multiple clients should not concurrently set permissions on the same metadata. This is not enforced by the SDX system itself, so the users of different clusters must ensure that any overlapping accesses are exclusively reads. (Note that this is already true even of single-cluster systems, even without SDX.)

Cloudera Navigator

Cloudera SDX with Cloudera Navigator has a number of known limitations:
  • Cloudera Navigator with Cloudera SDX requires Cloudera Manager 5.13.1 and higher.
  • Cloudera SDX with Cloudera Navigator will only work between two or more clusters managed by a single Cloudera Manager. Cloudera Navigator must be co-located with that Cloudera Manager.
  • Cloudera Navigator 5.13.1 de-duplicates HMS entities based on the RDS URL of the HMS database. Cloudera Navigator will see the entities in two different HMSs that point to the same RDS but treat different RDS URLs as distinct. Support for this will be added in a future Cloudera Navigator release.
  • Cloudera Navigator does not support de-duplicating HMS entities in SDX mode in clusters upgraded from CDH5.11 or before. In this case, a fresh install is required for de-duplication.
  • Cloudera Navigator does not show cross-cluster lineage for Spark or Impala operations, only Hive.
Auditing, metadata, and lineage view will be available in Cloudera Manager 5.13.1.

Unsupported Setups

Some potential SDX setups are unsupported. These include:
  • Azure: SDX is only supported on AWS-based clusters in 5.13. The expectation is that your shared databases as well as your instances are running in AWS.
  • RDS peering: SDX is not currently supported with more exotic RDS access controls than "same AWS account; username/password authentication".
  • RDS DBMSs: Only MySQL is supported in RDS for SDX in 5.13.
  • CDH version: SDX currently only supports setups in which all clusters in the SDX are using the same CDH version. SDX is only available in CDH 5.13 or later.
  • Scaling: We have not tested the scalability limits of SDX.
  • File systems: Only S3 is supported in SDX on CDH 5.13, not any other distributed file systems.