X

Cloudera Tutorials

Optimize your time with detailed tutorials that clearly explain the best way to deploy, use, and manage Cloudera products. Login or register below to access all Cloudera tutorials.

By registering or submitting your data, you acknowledge, understand, and agree to Cloudera's Terms and Conditions, including our Privacy Statement.
By checking this box, you consent to receive marketing and promotional communications about Cloudera’s products and services and/or related offerings from us, or sent on our behalf, in accordance with our Privacy Statement. You may withdraw your consent by using the unsubscribe or opt-out link in our communications.

Cloudera acquires Octopai's platform to enhance metadata management capabilities

Read the press release

 

Introduction

 

In this tutorial we’ll analyze datasets using Data Warehouse on Cloudera Data Platform (CDP) - Public Cloud. We will also visualize our data using Cloudera Data Visualization.

We will be looking at a synthetic data set for a mock company that has various high-tech factories across the world. It was recently noticed that there's been a downturn in revenue for the current fiscal year. We've been tasked to dig into the data to try and find any potential causes as to why this may be.

 

 

Prerequisites

 

  • Have access to Cloudera Data Platform (CDP) Public Cloud with predefined Cloudera Engineering (CDE) Virtual Cluster
  • Have created a CDP workload User
  • Ensure proper CDE role access
    • DEUser: access virtual cluster and run jobs
  • Basic AWS CLI skills

 

 

Watch Video

 

The video below provides an overview of what will be covered in this tutorial:

Download Assets

You have two (2) options to get the assets needed for this tutorial:

  1. Download a ZIP file

It contains only necessary files used in this tutorial. Unzip tutorial-files.zip and remember its location.

  1. Clone our GitHub repository

It provides assets used in this and other tutorials; organized by tutorial title.

 

Using AWS CLI, copy the following data files to your S3 bucket, defined by your environment’s storage.location.base attribute:

daily_employee_logs_export.csv
employee_metadata_export.csv
employee_sicktime_export.csv
employee_vacation_export.csv
factory_ambient_export.csv
factory_revenue_export.csv
machine_throughput_export.csv
machine_uptime_export.csv

Note: You may need to ask your environment's administrator to get property value for storage.location.base.

For example, property storage.location.base has value s3a://usermarketing-cdp-demo, therefore we will copy the files using the command:

aws s3 cp . s3://usermarketing-cdp-demo/ --recursive --exclude "*" --include "*.csv"

 

aws-cli-copy-files

 

Create an Impala Virtual Warehouse

 

Before we create a virtual warehouse, we need to make sure your environment is activated and running.

Beginning from CDP Home Page, select Data Warehouse.

If you see  next to the environment name, no need to activate it because it's already been activated and running.

Otherwise, click on  to activate the environment. This will create a default database catalog named <environment_name>-default.

In the Virtual Warehouse section, click on  to create a virtual warehouse:

  1. Name: Mock-Company
  2. Type: IMPALA
  3. Virtual Warehouse Catalog: environment_name-default
  4. Virtual Warehouse Size: XSMALL
  5. Install Data Visualization
  6. CREATE

 

virtual-warehouse-create

 

Introduction to Hue

 

Hue is a web-based interactive query editor that enables you to interact with databases and data warehouses.

 

Most commonly used interface is Editor:

  1. Selecting SQL icon  will display all existing database objects.
    You can get more detailed information by selecting on the object (i.e. database, table, view).
  2. Selecting Documents  will display all save queries
  3. Type a query in the editor panel and click the run icon  to run the query.

 

hue-intro-databases

 

Run Spark job using Cloudera Data Engineering

 

You must already have a Cloudera Data Engineering (CDE) Virtual Cluster created. Refer to Getting Started with Cloudera Data Engineering on CDP to learn how.

Included in Download Assets is file ingest_CDE.py. This PySpark job will ingest daily logs for machine efficiency, ambient weather conditions and employee data. It will create two (2) Impala databases, HR and FACTORY with its corresponding tables.

NOTE: Before running the job, we need to modify one (1) variable in ingest_CDE.py. Update variable s3BucketName using storage.location.base attribute; defined by your environment.

Let’s run this file using Cloudera Data Engineering command line interface (CLI). 

Run Spark Job:

cde spark submit --user <user> --vcluster-endpoint <virtual cluster endpoint> --conf "spark.pyspark.python=python3" ingest_CDE.py

Check Job Status:

cde run describe --vcluster-endpoint <virtual cluster endpoint> --id #, where # is the job id

Review the Output:

cde run logs --vcluster-endpoint <virtual cluster endpoint> --type "driver/stdout" --id #, where # is the job id

 

NOTE: Replace <virtual cluster endpoint> with JOBS API URL provided in Cluster Details.

 

cde-spark-job-output

 

Data Insights

 

Our task is to determine why revenue for the company has decreased in the current year. For this we’ll use Hue for quick analysis of our data.

 

 

Open Hue

 

Beginning from CDP home page > Data Warehouse:

  1. Click on Overview
  2. Search for your Virtual Warehouse
  3. Click on 
  4. Open Hue

 

virtual-warehouse-open-hue

 

Run Queries for Insights

Query 1 - Factory Performance

 

-- Query 1 - Factory Performance
-- Factory Unit Production
DROP VIEW IF EXISTS factory.v_units_prod;

CREATE VIEW factory.v_units_prod AS
    SELECT factory_id, machine_id, ROUND(AVG(daily_units_produced),0) AS avg_units_produced
      FROM factory.machine_throughput
      GROUP BY factory_id, machine_ID
      ORDER BY factory_id, machine_ID LIMIT 100;

SELECT * FROM factory.v_units_prod;


-- Factory Revenue
DROP VIEW IF EXISTS factory.v_revenue;

CREATE VIEW factory.v_revenue AS
    SELECT factory_id, sum(revenue) AS total_revenue
      FROM (  SELECT factory_id, (avg_units_produced * revenue_per_unit) AS revenue
                FROM factory.v_units_prod mt JOIN factory.machine_revenue mr
                  ON mt.machine_id = mr.machine_id
           ) AS v_rev
    GROUP by factory_id
    ORDER BY factory_id ASC LIMIT 100;

SELECT * FROM factory.v_revenue;

 

hue-query1-factory-performance

 

We have confirmed that there has been a decrease in both revenue and unit production. The culprit is factory 2.

Let’s run some queries to see if we can dig deeper and find some reasons as to why factory 2 could be showing a lower revenue.

 

 

Query 2 - Check Machine Uptime for Factory 2

 

SELECT machine_id, ROUND(AVG(hours_operational),0) AS avg_hours_operational
    FROM factory.machine_uptime
    WHERE factory_id = 2
    GROUP BY machine_id
    ORDER BY machine_id ASC;

 

hue-query2

 

The machine uptime is roughly that same, which tells us we do not have a specific machine issue. Instead, this is a factory wide issue - more exploration is needed.

Since we have HR data, we can look there to see if there's any correlation in employee data that may explain the decreased revenue in factory 2.

 

 

Query 3 - Number of Hours Worked per Factory

SELECT factory_id, AVG(time_worked)
    FROM hr.timesheet
    GROUP BY factory_id
    ORDER BY factory_id ASC;

 

hue-query3

 

Factory 2 has the lowest number of hours worked. Let’s find out why.

 

 

Query 4 - Employee Leave Time Distribution Per Factory

DROP VIEW IF EXISTS hr.v_employee_leave_time;

CREATE VIEW hr.v_employee_leave_time AS
    SELECT factory_id, leave_type, COUNT(leave_type)/COUNT(DISTINCT employee_id) AS avg_time
      FROM hr.leave_time
      GROUP BY factory_id, leave_type
      ORDER BY factory_id ASC, leave_type DESC LIMIT 100;

SELECT * FROM hr.v_employee_leave_time;

 

hue-query4

 

Employees in factory 2 have a higher number of sick days. It seems that unlike the other factories, factory 2 employees take a disproportionate amount of sick time and less vacation time. Since this is unique to factory 2, this seems worthy of reporting as a potential cause in the downturn of revenue.

 

 

Generate Report using Cloudera Data Visualization

Open Viz

 

Beginning from CDP home page > Data Warehouse:

  1. Click on Overview
  2. Search for your Virtual Warehouse
  3. Click on 
  4. Open Viz

 

 

virtual-warehouse-open-viz

 

NOTE: You may need to logon using your CDP workload userid/password.

 

 

Create Dataset(s) from Data

 

Starting from Cloudera Data Visualization home page, select DATA.

  1. Select Default Impala VW connection
  2. Select Datasets tab
  3. Select NEW DATASET

 

Create datasets for the following tables:

Dataset Title: Factory Revenue
Dataset Source: From Table
Select Database: factory
Select Table: v_revenue

 

Dataset Title: Factory Unit Production
Dataset Source: From Table
Select Database: factory
Select Table: v_units_prod

 

Dataset Title: Employee Sick Time
Dataset Source: From Table
Select Database: hr
Select Table: v_employee_leave_time

 

viz-create-dataset

NOTE: You may need to edit the dataset and confirm that the fields are marked correctly (i.e. Dimensions, Measures).

 

Create Visuals

 

  1. Select VISUALS
  2. Select NEW DASHBOARD

 

viz-new-dashboard

 

Let’s use Revenue Downturn Investigation for the title of this visual.

In Dashboard Designer, select Visuals tab. We are going to create three (3) new visuals using data connection Default Impala VW.

 

viz-create-new-visual

 

Create visual for table Factory Revenue:

Title: Revenue Per Factory
Build Tab:

Graph Type: Bars
X Axis: factory_id, (ascending order)
Y Axis: total_revenue
Refresh Visual

Settings > Marks > Label measure values for bars: ON
Settings > Axes > Dynamically size graph to window size: ON

 

viz-visual-factory-revenue

 

Create visual for table Factory Unit Production:

Title: Unit Production Per Factory
Build Tab:

Graph Type: Grouped Bars
X Axis: factory_id, machine_id
Y Axis: avg_units_produced
Colors: machine_id
Refresh Visual

Click Explore Options icon  on the visual and select Settings. Choose Skinny Bars, dynamic sizing.

 

viz-visual-factory-unit-production

 

Create visual for table Employee Sick Time:

Title: Employee Leave Time Distribution Per Factory
Build Tab:

Graph Type: Grouped Bars
X Axis: factory_id(ascending order) Leave_type (descending order)
Y Axis: avg_time
Refresh Visual

 

viz-visual-employee-sick-time

 

We are done creating our dashboard. To save it, click on Save.

It should look like the following:

 

viz-visual-final

 

Summary

 

Congratulations on completing the tutorial.

In this tutorial, you got to experiment with several services that Cloudera Data Platform (CDP) has to offer: 

  • Cloudera Data Warehouse (Impala, Hue and Data Visualization)
  • Cloudera Data Engineering

 

As you have seen, it was easy to analyze datasets and create beautiful reports using Cloudera Data Visualization.

This tutorial covered a very small portion of what Cloudera Data Warehouse (CDW), Cloudera Data Engineering (CDE) and other Cloudera Data Platform (CDP) experiences can do. You are encouraged to continue exploring and broaden your knowledge.

 

Your form submission has failed.

This may have been caused by one of the following:

  • Your request timed out
  • A plugin/browser extension blocked the submission. If you have an ad blocking plugin please disable it and close this message to reload the page.