X

Cloudera Tutorials

Optimize your time with detailed tutorials that clearly explain the best way to deploy, use, and manage Cloudera products. Login or register below to access all Cloudera tutorials.

By registering or submitting your data, you acknowledge, understand, and agree to Cloudera's Terms and Conditions, including our Privacy Statement.
By checking this box, you consent to receive marketing and promotional communications about Cloudera’s products and services and/or related offerings from us, or sent on our behalf, in accordance with our Privacy Statement. You may withdraw your consent by using the unsubscribe or opt-out link in our communications.

Cloudera acquires Octopai's platform to enhance metadata management capabilities

Read the press release

 

Introduction

 

Data Lifecycle - Data Reporting. We will analyze datasets using Data Warehouse on Cloudera Data Platform (CDP) - Public Cloud.

We will be looking at a synthetic data set for an electric car company. It contains information on a specific experimental motor used in different car models. It was determined that faulty materials were used in a batch of electric motors. We will do some data discovery and create an easy-to-read dashboard.

 

 

Prerequisites

 

 

 

Watch Video

 

The video below provides a brief overview of what is covered in this tutorial:

 

 

Download Assets

 

There are two (2) options in getting assets for this tutorial:

  1. Download a ZIP file

It contains only necessary files used in this tutorial. Unzip tutorial-files.zip and remember its location.

  1. Clone our GitHub repository

It provides assets used in this and other tutorials; organized by tutorial title.

 

Using AWS CLI, copy the following data files to your S3 bucket, defined by your environment’s storage.location.base attribute:

experimental-motor-enriched.csv

Note: You may need to ask your environment's administrator to get property value for storage.location.base.

For example, property storage.location.base has value s3a://usermarketing-cdp-demo; we will copy the files using the command:

aws s3 cp experimental-motor-enriched.csv s3://usermarketing-cdp-demo/experimental-motor-enriched.csv

 

output-awscli-copyfiles

 

Create a Virtual Warehouse

 

Before you can create a virtual warehouse, you need to make sure your environment is activated and running.

Beginning from the CDP Home Page, select Data Warehouse.

 

cdp-dw

 

If you see  next to your environment name, you’re all set - the Database Catalog has been created using the naming convention, environment-default.

Otherwise, click on  to activate the environment.

In the Virtual Warehouse section, click on  to create a virtual warehouse:

  1. Name: hive-virtual-warehouse
  2. Type: HIVE
  3. Database Catalog: environment-default
  4. Size: xsmall
  5. Install Data Visualization
  6. CREATE

 

virtual-warehouse-create

 

Create Table and Views

 

We will use HUE to create our table and views. The views will help us later when we create visual reports using Cloudera Data Visualization.

Beginning from the CDP Home Page, select Data Warehouse

  • Click on Overview
  • Search for your Virtual Warehouse, hive-virtual-warehouse
  • Click on 
  • Open Hue

 

virtual-warehouse-openhue

We’ve included file create-table.sql in download assets. It contains all SQL statements needed for this tutorial. For your convenience, we’ve also provided it below.

 

IMPORTANT: You must update <S3BUCKET> with the location of dataset and remove comment tag.

Enter and run the following SQL statements in the HUE editor:

CREATE DATABASE IF NOT EXISTS factory;
CREATE DATABASE IF NOT EXISTS reports;

DROP TABLE IF EXISTS factory.experimental_motor_enriched;
CREATE EXTERNAL TABLE factory.experimental_motor_enriched
  (  serial_no         STRING
   , vin               STRING
   , model             STRING
   , zip               INTEGER
   , customer_id       INTEGER
   , username          STRING
   , name              STRING
   , gender            CHAR(1)
   , email             STRING
   , occupation        STRING
   , birthdate         DATE
   , address           STRING
   , salary            FLOAT
   , sale_date         DATE
   , saleprice         FLOAT
   , latitude          STRING
   , longitude         STRING
   , factory_no        INTEGER
   , machine_no        INTEGER
   , part_no           STRING
   , local_timestamp   DOUBLE
   , status            STRING
  )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (  'separatorChar' = ','
                      , 'quoteChar'  = '"')
-- IMPORTANT: Update <S3BUCKET> with the location of dataset and remove comment tag.
-- LOCATION '<S3BUCKET>'
tblproperties('skip.header.line.count' = '1');



-- LIST OF ALL BETA ENGINES
CREATE VIEW reports.beta_engines AS
  SELECT model, latitude, longitude
    FROM factory.experimental_motor_enriched;



-- RECALL NEEDED:
-- Parts created BETWEEN October 22 and October 24 were found
-- to be defective and need to be recalled.
CREATE VIEW reports.beta_engine_recall AS
  SELECT sale_date, model, vin, name, address, latitude, longitude
    FROM factory.experimental_motor_enriched
    WHERE sale_date BETWEEN CAST('2020-10-22' AS DATE) AND CAST('2020-10-24' AS DATE)
    ORDER BY sale_date DESC;

 

hue-create-table-views

 

Create Dashboard using Cloudera Data Visualization

 

Beginning from the CDP Home Page, select Data Warehouse

  • Click on Overview
  • Search for your Virtual Warehouse, hive-virtual-warehouse
  • Click on 
  • Open Data Visualization

 

NOTE: You may need to logon using your CDP workload userid/password.

 

virtual-warehouse-openviz

 

Create Datasets

 

Starting from Cloudera Data Visualization home page, select DATA.

  1. Select Default Hive VW connection

  2. Select Datasets tab

  3. Select NEW DATASET

 

Create dataset(s) from the following tables:

Dataset Title: report_beta_engines

Dataset Source: From Table

Select Database: reports

Select Table: beta_engines

 

Dataset Title: report_engine_recall

Dataset Source: From Table

Select Database: reports

Select Table: beta_engine_recall

 

viz-create-dataset

 

Create Dashboard

 

  1. Select VISUALS
  2. Select NEW DASHBOARD

 

viz-new-dashboard

 

Let’s use Beta Engine Report for the title of this dashboard.

In the Dashboard Designer, select Visuals tab. We are going to create several visuals using data connection Default Hive VW.

 

viz-create-new-visual

 

Create five (5) visuals using dataset report_beta_engines:

Title: Location of all Sold Engines
Build Tab:

Graph Type: Interactive Map
Geo: latitude, longitude

Important: You need to have Mapbox token to use Interactive Map. If you receive a Mapbox Token Error, you may create your own from mapbox.com and setup as follows:

Settings Tab:

Map Server > Mapbox > Mapbox token for visual: token provided by Mapbox

 

viz-all-sold-engines

 

Title: TOTAL
Build Tab:

Graph Type: KPI
Measure: count(model)

 

The next three (3) visuals can be created just as you have been. Since they are similar to the previous, we will clone it and modify it as needed.
Click on  and select Clone.

 

Title: MODEL C
Build Tab:

Graph Type: KPI
Measure: count(model)
Filters: model in (‘Model C’)

 

Title: MODEL D
Build Tab:

Graph Type: KPI
Measure: count(model)
Filters: model in (‘Model D’)

 

Title: MODEL R
Build Tab:

Graph Type: KPI
Measure: count(model)
Filters: model in (‘Model R’)

 

viz-totals-all-engines

 

Create five (5) visuals using dataset report_engine_recall:

Title: Location of Engines to be Recalled
Build Tab:

Graph Type: Interactive Map
Geo: latitude, longitude

Important: You need to have Mapbox token to use Interactive Map. If you receive a Mapbox Token Error, you may create your own from mapbox.com and setup as follows:

Settings Tab:

Map Server > Mapbox > Mapbox token for visual: token provided by Mapbox

 

viz-engine-recall

 

Title: TOTAL RECALL
Build Tab:

Graph Type: KPI
Measure: #Record Count

 

The next three (3) visuals can be created just as you have been. Since they are similar to the previous, we will clone it and modify it as needed.
Click on  and select Clone.

 

Title: MODEL C
Build Tab:

Graph Type: KPI
Measure: #Record Count
Filters: model in (‘Model C’)

 

Title: MODEL D
Build Tab:

Graph Type: KPI
Measure: #Record Count
Filters: model in (‘Model D’)

 

Title: MODEL R
Build Tab:

Graph Type: KPI
Measure: #Record Count
Filters: model in (‘Model R’)

 

viz-totals-recalled-engines

 

We are done creating our dashboard. To save it, click on SAVE.

It should look like the following:

 

viz-dashboard-final

 

Summary

 

Congratulations on completing the tutorial.

You explored a small subset of services that Cloudera Data Platform (CDP) has to offer:

  • Cloudera Data Warehouse (Hive and Hue)
  • Cloudera Data Visualization

 

As you have experienced, it was easy to analyze datasets and create beautiful reports using Cloudera Data Visualization.

 

Your form submission has failed.

This may have been caused by one of the following:

  • Your request timed out
  • A plugin/browser extension blocked the submission. If you have an ad blocking plugin please disable it and close this message to reload the page.