In this tutorial you’ll learn how to analyze datasets using Data Analytics Studio (DAS) on Cloudera Data Platform (CDP) - Public Cloud. We will also visualize data using a common Business Intelligence (BI) tool - Tableau.
Compose: Worksheets are used to write SQL statements. You can execute SQL statements, view results and get visual explain for queries. You can also save worksheets for re-running later or download results as a CSV file.
Database: Provides detailed information about the structure and metadata of tables within databases.
Before we create a virtual warehouse, we need to make sure the environment is activated and running.
If you see next to the environment name, no need to activate it because it's already been activated and running.
Otherwise, click on to activate the environment. This will create the default database catalog; environment_name-default.
In the Virtual Warehouse section, click on to create a virtual warehouse:
- Name: usermarketing-demo
- Type: HIVE
- Virtual Warehouse Catalog: usermarketing-default
- Virtual Warehouse Size: XSMALL
- AutoSuspend Timeout: 300 seconds
- Nodes: Min: 2, Max: 4
- WaitTime: 30 seconds
- Click CREATE
Now that we have DAS opened, click on Compose, copy-paste the following DDL into the Worksheet, make one modification, then execute:
CREATE EXTERNAL TABLE IF NOT EXISTS usedcars ( id INT, year INT, manufacturer STRING, model STRING, price DECIMAL(12,2), odometer INT, color STRING, transmission STRING, cylinders STRING, fuel STRING, condition STRING, title_status STRING, region STRING, url STRING, description STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE -- Modify <storage-location> with Hive Metastore attribute: -- hive.metastore.warehouse.external.dir LOCATION '<storage-location>/usedcars' tblproperties ("skip.header.line.count"="1")
NOTE: The modification needed above is to specify the storage location for the CSV file. Download dataset and place it in the storage location.
- Hive Metastore attribute hive.metastore.warehouse.external.dir is s3a://usermarketing-cdp-demo/warehouse/tablespace/external/hive
- modify <storage-location> with s3a://usermarketing-cdp-demo/warehouse/tablespace/external/hive
- After I downloaded/decompressed dataset, I used AWS CLI to copy the file using command:
- aws s3 cp usedcars.csv s3://usermarketing-cdp-demo/warehouse/tablespace/external/hive/usedcars/usedcars.csv
Let's run a simple query to get the average price of cars, grouped by the model year. Copy-Paste the following query into the Worksheet and execute:
SELECT year, CAST(AVG(price) AS DECIMAL(12,2)) AS avg_price, count(*) AS num_cars FROM usedcars WHERE year > 1900 AND year <= 2021 GROUP BY year ORDER BY year DESC ;
Two things are needed to visualize data using Tableau.
- JDBC Driver (*.jar)
- JDBC URL
Both of these may be obtained by our virtual warehouse. Starting from CDP home page > Data Warehouse:
- Click on Overview
- Search for your Virtual Warehouse
- Click on
- select Copy JDBC URL or Download JDBC Jar
We will use Tableau for our BI tool. According to Tableau's website, the JDBC driver must be stored in specific location:
- Windows: C:\Program Files\Tableau\Drivers
- Mac: ~/Library/Tableau/Drivers
- Linux: /opt/tableau/tableau_driver/jdbc
To begin visualizing your data, Open Tableau:
- Select Other Databases (JDBC)
- URL: Past the URL you copied from your virtual warehouse
- Dialect: SQL92
- Username/Password: use credentials used to connect to DAS
- Sign In
In the Data Source tab:
- select default schema
- select and drag table usedcars to the top-right
Select Sheet 1 tab:
- select and drag Year into Columns
- select and drag Price into Rows
- modify measure for Price to use Average
Finally, for our finishing touches:
- click on Show Me and select area charts
- right-click on Avg. Price > Edit Axis > Uncheck Include zero