Your browser is out of date

Update your browser to view this website correctly. Update my browser now

×

Cloudera Tutorials

Optimize your time with detailed tutorials that clearly explain the best way to deploy, use, and manage Cloudera products. Login or register below to access all Cloudera tutorials.

 

Introduction

 

In this tutorial you’ll learn how to analyze datasets using Data Analytics Studio (DAS) on Cloudera Data Platform (CDP) - Public Cloud. We will also visualize data using a common Business Intelligence (BI) tool - Tableau.

 

Prerequisites

 

  • Have access to Cloudera Data Platform (CDP) Public Cloud with a Data Lake running.
  • Basic AWS CLI skills
  • (optionally) Installed Tableau. For this tutorial, we're using Tableau Desktop version 2020.2.0 for Windows.

 

Watch Video

 

The video below provides a brief overview of what is covered in this tutorial:

 

 

Become Familiar with Data Analytics Studio (DAS)

 

There are 3 tabs to interact with in Data Analytics Studio:

Queries: Allows you to view, search and run previously executed queries. Clicking on query provides additional information such as query details, visual explain, configurations and timeline.

das-option-queries

 

Compose: Worksheets are used to write SQL statements. You can execute SQL statements, view results and get visual explain for queries. You can also save worksheets for re-running later or download results as a CSV file.

das-option-compose

 

Database: Provides detailed information about the structure and metadata of tables within databases.

das-option-database

 

Create Virtual Warehouse

 

Before we create a virtual warehouse, we need to make sure the environment is activated and running.

If you see next to the environment name, no need to activate it because it's already been activated and running.

Otherwise, click on to activate the environment. This will create the default database catalog; environment_name-default.

In the Virtual Warehouse section, click on to create a virtual warehouse:

  1. Name: usermarketing-demo
  2. Type: HIVE
  3. Virtual Warehouse Catalog: usermarketing-default
  4. Virtual Warehouse Size: XSMALL
  5. AutoSuspend Timeout: 300 seconds
  6. Nodes: Min: 2, Max: 4
  7. WaitTime: 30 seconds
  8. Click CREATE
create-virtual-warehouse

 

Create and Populate Table

 

Open DAS from your virtual warehouse.

Beginning from CDP home page > Data Warehouse:

  1. Click on Overview
  2. Search for your Virtual Warehouse
  3. Click on 
  4. Open DAS
virtual-warehouse-options-das

 

Now that we have DAS opened, click on Compose, copy-paste the following DDL into the Worksheet, make one modification, then execute:

CREATE EXTERNAL TABLE IF NOT EXISTS usedcars (
  id            INT,
  year          INT,
  manufacturer  STRING,
  model         STRING,
  price         DECIMAL(12,2),
  odometer      INT,
  color         STRING,
  transmission  STRING,
  cylinders     STRING,
  fuel          STRING,
  condition     STRING,
  title_status  STRING,
  region        STRING,
  url           STRING,
  description   STRING
)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
-- Modify <storage-location> with Hive Metastore attribute:
--                      hive.metastore.warehouse.external.dir
LOCATION '<storage-location>/usedcars'
tblproperties ("skip.header.line.count"="1")

NOTE: The modification needed above is to specify the storage location for the CSV file. Download dataset and place it in the storage location.

For example:

  • Hive Metastore attribute hive.metastore.warehouse.external.dir is s3a://usermarketing-cdp-demo/warehouse/tablespace/external/hive
  • modify <storage-location> with s3a://usermarketing-cdp-demo/warehouse/tablespace/external/hive
  • After I downloaded/decompressed dataset, I used AWS CLI to copy the file using command:
    • aws s3 cp usedcars.csv s3://usermarketing-cdp-demo/warehouse/tablespace/external/hive/usedcars/usedcars.csv
aws-cli-output
das-create-table

 

Let's run a simple query to get the average price of cars, grouped by the model year. Copy-Paste the following query into the Worksheet and execute:

SELECT year, CAST(AVG(price) AS DECIMAL(12,2)) AS avg_price, count(*) AS num_cars
  FROM usedcars
  WHERE year > 1900
    AND year <= 2021
  GROUP BY year
  ORDER BY year DESC
;

 

das-query-cars

 

Visualize Data

 

Two things are needed to visualize data using Tableau.

  • JDBC Driver (*.jar)
  • JDBC URL

Both of these may be obtained by our virtual warehouse. Starting from CDP home page > Data Warehouse:

  1. Click on Overview
  2. Search for your Virtual Warehouse
  3. Click on 
  4. select Copy JDBC URL or Download JDBC Jar
virtual-warehouse-options-jdbc

 

We will use Tableau for our BI tool. According to Tableau's website, the JDBC driver must be stored in specific location:

  • Windows: C:\Program Files\Tableau\Drivers
  • Mac: ~/Library/Tableau/Drivers
  • Linux: /opt/tableau/tableau_driver/jdbc

 

To begin visualizing your data, Open Tableau:

  1. Select Other Databases (JDBC)
  2. URL: Past the URL you copied from your virtual warehouse
  3. Dialect: SQL92
  4. Username/Password: use credentials used to connect to DAS
  5. Sign In
tableau-connect

 

In the Data Source tab:

  1. select default schema
  2. select and drag table usedcars to the top-right
tableau-data-source

 

Select Sheet 1 tab:

  1. select and drag Year into Columns
  2. select and drag Price into Rows
  3. modify measure for Price to use Average
tableau-visualize-data

 

Finally, for our finishing touches:

  1. click on Show Me and select area charts
  2. right-click on Avg. Price > Edit Axis > Uncheck Include zero
tableau-visualize-final

 

Summary

 

Congratulations on completing the tutorial.

As you have seen, it is easy to analyze datasets using Data Analytics Studio (DAS) on Cloudera Data Platform (CDP). I'm sure you will agree that connecting a BI tool, in this case Tableau, was easy and painless.

 

Further Reading

 

Videos

Other

 

Your form submission has failed.

This may have been caused by one of the following:

  • Your request timed out
  • A plugin/browser extension blocked the submission. If you have an ad blocking plugin please disable it and close this message to reload the page.