Many enterprises have heterogeneous data platforms and technology stacks across different business units or data domains. For decades, they have been struggling with scale, speed, and correctness required to derive timely, meaningful, and actionable insights from vast and diverse big data environments. Despite various architectural patterns and paradigms, they still end up with perpetual “data puddles” and silos in many non-interoperable data formats. Constant data duplication, complex Extract, Transform & Load (ETL) pipelines, and sprawling infrastructure leads to prohibitively expensive solutions, adversely impacting the Time to Value, Time to Market, overall Total Cost of Ownership (TCO), and Return on Investment (ROI) for the business.
Cloudera’s open data lakehouse, powered by Apache Iceberg, solves the real-world big data challenges mentioned above by providing a unified, curated, shareable, and interoperable data lake that is accessible by a wide array of Iceberg-compatible compute engines and tools.
The Apache Iceberg REST Catalog takes this accessibility to the next level simplifying Iceberg table data sharing and consumption between heterogeneous data producers and consumers via an open standard RESTful API specification.
The Cloudera open data lakehouse, powered by Apache Iceberg and the REST Catalog, now provides the ability to share data with non-Cloudera engines in a secure manner.
With Cloudera’s open data lakehouse, you can improve data practitioner productivity and launch new AI and data applications much faster with the following key features:
Data sharing is the capability to share data managed in Cloudera, specifically Iceberg tables, with external users (clients) who are outside of the Cloudera environment. You can share Iceberg table data with your clients who can then access the data using third party engines like Amazon Athena, Trino, Databricks, or Snowflake that support Iceberg REST catalog.
The solution covered by this blog describes how Cloudera shares data with an Amazon Athena notebook. Cloudera uses a Hive Metastore (HMS) REST Catalog service implemented based on the Iceberg REST Catalog API specification. This service can be made available to your clients by using the OAuth authentication mechanism defined by the
KNOX token management system and using Apache Ranger policies for defining the data shares for the clients. Amazon Athena will use the Iceberg REST Catalog Open API to execute queries against the data stored in Cloudera Iceberg tables.
The following components in Cloudera on cloud should be installed and configured:
The following AWS prerequisites:
In this example, you will see how to use Amazon Athena to access data that is being created and updated in Iceberg tables using Cloudera.
Please reference user documentation for installation and configuration of Cloudera Public Cloud.
Open HUE and execute the following to create a database and tables.
CREATE DATABASE IF NOT EXISTS airlines_data;
DROP TABLE IF EXISTS airlines_data.carriers;
CREATE TABLE airlines_data.carriers (
carrier_code STRING,
carrier_description STRING)
STORED BY ICEBERG
TBLPROPERTIES ('format-version'='2');
DROP TABLE IF EXISTS airlines_data.airports;
CREATE TABLE airlines_data.airports (
airport_id INT,
airport_name STRING,
city STRING,
country STRING,
iata STRING)
STORED BY ICEBERG
TBLPROPERTIES ('format-version'='2');
In HUE execute the following to load data into each Iceberg table.
INSERT INTO airlines_data.carriers (carrier_code, carrier_description)
VALUES
("UA", "United Air Lines Inc."),
("AA", "American Airlines Inc.")
;
INSERT INTO airlines_data.airports (airport_id, airport_name, city, country, iata)
VALUES
(1, 'Hartsfield-Jackson Atlanta International Airport', 'Atlanta', 'USA', 'ATL'),
(2, 'Los Angeles International Airport', 'Los Angeles', 'USA', 'LAX'),
(3, 'Heathrow Airport', 'London', 'UK', 'LHR'),
(4, 'Tokyo Haneda Airport', 'Tokyo', 'Japan', 'HND'),
(5, 'Shanghai Pudong International Airport', 'Shanghai', 'China', 'PVG')
;
In HUE execute the following query. You will see the 2 carrier records in the table.
SELECT * FROM airlines_data.carriers;
Create a policy that will allow the “rest-demo” role to have read access to the Carriers table, but will have no access to read the Airports table.
In Ranger go to Settings > Roles to validate that your Role is available and has been assigned group(s).
In this case I’m using a role named - “UnitedAirlinesRole” that I can use to share data.
Add a Policy in Ranger > Hadoop SQL.
Create new Policy with the following settings, be sure to save your policy
6. Create an Amazon Athena notebook with the “Spark_primary” Workgroup
a. Provide a name for your notebook
b. Additional Apache Spark properties - this will enable use of the Cloudera Iceberg REST Catalog. Select the “Edit in JSON” button. Copy the following and replace <cloudera-knox-gateway-node>, <cloudera-env-name>, <client-id>, and <client-secret> with the appropriate values. See REST Catalog Setup blog to determine what values to use for replacement.
{
"spark.sql.catalog.demo": "org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.demo.default-namespace": "airlines",
"spark.sql.catalog.demo.type": "rest",
"spark.sql.catalog.demo.uri": "https://<cloudera-knox-gateway-node>/<cloudera-env-name>/cdp-share-access/hms-api/icecli",
"spark.sql.catalog.demo.credential": "<client-id>:<client-secret>",
"spark.sql.defaultCatalog": "demo",
"spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
}
c. Click on the “Create” button, to create a new notebook
Run the following commands 1 at a time to see what is available from the Cloudera REST Catalog. You will be able to:
spark.sql(show databases).show();
spark.sql(use airlines_data);
spark.sql(show tables).show();
spark.sql(SELECT * FROM airlines_data.carriers).show()
In HUE execute the following to add a row to the Carriers table.
INSERT INTO airlines_data.carriers
VALUES("DL", "Delta Air Lines Inc.");
In HUE and execute the following to add a row to the Carriers table.
SELECT * FROM airlines_data.carriers;
Execute the following query - you should see 3 rows returned. This shows that the REST Catalog will automatically handle any metadata pointer changes, guaranteeing that you will get the most recent data.
spark.sql(SELECT * FROM airlines_data.carriers).show()
Execute the following query. This query should fail, as expected, and will not return any data from the Airports table. The reason for this is that the Ranger Policy is being enforced and denies access to this table.
spark.sql(SELECT * FROM airlines_data.airports).show()
In this post, we explored how to set up a data share between Cloudera and Amazon Athena. We used Amazon Athena to connect via the Iceberg REST Catalog to query data created and maintained in Cloudera.
Key features of the Cloudera open data lakehouse include:
Amazon Athena is a serverless, interactive analytics service that provides a simplified and flexible way to analyze petabytes of data where it lives.. Amazon Athena also makes it easy to interactively run data analytics using Apache Spark without having to plan for, configure, or manage resources. When you run Apache Spark applications on Athena, you submit Spark code for processing and receive the results directly. Use the simplified notebook experience in Amazon Athena console to develop Apache Spark applications using Python or Use Athena notebook APIs. The Iceberg REST Catalog integration with Amazon Athena allows organizations to leverage the scalability and processing power of EMR Spark for large-scale data processing, analytics, and machine learning workloads on large datasets stored in Cloudera Iceberg tables.
For enterprises facing challenges with their diverse data platforms, who might be struggling with issues related to scale, speed, and data correctness, this solution can provide significant value. This solution can reduce data duplication issues, simplify complex ETL pipelines, and reduce costs, while improving business outcomes.
To learn more about Cloudera and how to get started, refer to Getting Started. Check out Cloudera’s open data lakehouse to get more information about the capabilities available or visit Cloudera.com for details on everything Cloudera has to offer. Refer to Getting Started with Amazon Athena
This may have been caused by one of the following: