Guidelines for Using SDX Namespaces
- When you create databases and tables, use the LOCATION attribute to write the data to cloud storage.
When you create a database or table that you want to make accessible to other clusters that share the SDX namespace, you must create the database or table in cloud storage. Use the LOCATION attribute to indicate the location in cloud storage where you want to create the database or table.
If you do not provide the location, the database or table is created in a default location in HDFS in the cluster. When the cluster is terminated, the HDFS databases and tables are lost.The following examples show the CREATE DATABASE statement with the LOCATION attribute pointing to S3 and ADLS locations:
CREATE DATABASE databasename LOCATION s3a://path-to-aws-s3/dir CREATE DATABASE databasename LOCATION adl://path-to-azure-adlgen1/dir CREATE DATABASE databasename LOCATION abfs(s)://path-to-azure-adlgen2/dir
For more information, see CREATE DATABASE Statement.The following examples show the CREATE TABLE statement with the LOCATION attribute pointing to S3 and ADLS locations:
CREATE EXTERNAL TABLE tablename LOCATION s3a://path-to-aws-s3/dir/table_data CREATE EXTERNAL TABLE tablename LOCATION adl://path-to-azure-adlgen1/dir/table_data CREATE EXTERNAL TABLE tablename LOCATION abfs(s)://path-to-azure-adlgen2/dir/table_data
For more information, see CREATE TABLE Statement.To view the location attribute of a database, use the DESCRIBE DATABASE statement:
DESCRIBE DATABASE databasename
- When you create interim tables in HDFS, use unique names.
Problems can arise when different clusters using the same SDX namespace create interim Hive tables with the same name in HDFS. For example, a cluster creates a table named Temp to store temporary data in HDFS. When a job in another cluster also creates a table named Temp to store temporary data in HDFS, the job can fail or it can overwrite data in the table.
To avoid naming conflicts, use names unique to the cluster. An easy way to make a table name unique is to include the cluster ID in the table name.
- Avoid concurrent updates by multiple clusters to the same schema, table, or partitions in a table.
The Altus SDX service does not manage the metadata updates made by different clusters. It does not have a mechanism to lock the metadata to prevent simultaneous updates by multiple clusters. Data conflicts and errors can arise if multiple clusters sharing an SDX namespace access a dataset at the same time and perform conflicting updates.
For example, problems can arise if multiple clusters concurrently update the same table or partitions within a table or add or change the same schema or database.
Run your workloads in a way that ensures that multiple clusters do not make overlapping data or metadata changes.
- Refresh Altus Data Warehouse clusters after other clusters make changes to the metadata.
If you configure an Altus Data Warehouse cluster to use an SDX namespace that is also used by other clusters, you must take into consideration that the Impala service in the Altus Data Warehouse cluster keeps a local HDFS metadata cache.
If another cluster modifies the dataset for the Altus Data Warehouse cluster, Altus SDX updates the SDX namespace with the change in the metadata. However, because the change is made in another cluster, the Altus Data Warehouse cluster is not updated. You must run refresh or invalidate metadata operations to incorporate the latest updates.
For example, you create an Altus Data Warehouse cluster and configure it to use an SDX namespace. Then you start another cluster configured to use the same SDX namespace and add a column to a table in the dataset. Altus SDX updates the metadata in the SDX namespace with the new column. However, because the change to the table is done outside the Altus Data Warehouse cluster, the metadata cache in the Altus Data Warehouse cluster is not updated with the new column.
To avoid errors with obsolete metadata, refresh or invalidate the metadata in the Altus Data Warehouse cluster to get the latest metadata from the SDX namespace.
- Ensure that interim local tables in HDFS are deleted before you terminate a cluster.
When you write interim data to a table stored in HDFS in a cluster, the metadata for the interim files is stored in the SDX namespace. If you terminate the cluster, Altus SDX does not delete the metadata for these tables from the SDX namespace.
The metadata for the HDFS tables remain in the SDX namespace and can cause data conflicts and errors for other clusters. For example, a job in another cluster that uses the same SDX namespace might try to read data in the tables and encounter errors because the HDFS locations do not exist or are not be valid.
To avoid errors with orphaned metadata in an SDX namespace, delete all tables created in HDFS before you terminate a cluster.
- Do not delete an SDX namespace that is being used by an Altus cluster.
When you delete a configured SDX namespace, Altus deletes the SDX namespace in Altus and the associated SDX Sentry administrator group. Altus does not delete the Hive metastore or the Sentry database that the configured SDX namespace points to. If you delete a configured SDX namespace, Altus clusters lose access to the Hive metastore and Sentry database that the SDX namespace points to. Before you delete an SDX namespace, verify that the namespace is not used by any Altus cluster.
Other CDH clusters that share the same Hive metastore and Sentry database, such as clusters created using Director, are not affected when the configured SDX namespace is deleted and can still access the metadata.
- When you use S3Guard for clusters that store data in AWS S3, use the same S3Guard instance for all clusters that share an SDX namespace.
Using the same S3Guard instance for clusters that share the same SDX namespace ensures that the clusters see the metadata stored in AWS S3 in a consistent manner.
You can set clusters to use S3Guard by enabling S3Guard consistency in the Altus environment that you use to create them. When you create clusters using the same Altus environment with S3Guard consistency enabled, the clusters you create use the same S3Guard instance.
If you create clusters using different Altus environments with S3Guard consistency enabled, set the environments to use the same S3Guard instance if you want the clusters to share the same SDX namespace.