Cloudera Tutorials

Optimize your time with detailed tutorials that clearly explain the best way to deploy, use, and manage Cloudera products. Login or register below to access all Cloudera tutorials.

First Name

Last Name

Job Title

Business Email

Company

Phone

Country

By registering or submitting your data, you acknowledge, understand, and agree to Cloudera's Terms and Conditions, including our Privacy Statement.

By checking this box, you consent to receive marketing and promotional communications about Cloudera’s products and services and/or related offerings from us, or sent on our behalf, in accordance with our Privacy Statement. You may withdraw your consent by using the unsubscribe or opt-out link in our communications.

Back to main tutorial page

ClouderaNOW Learn about the latest innovations in data, analytics, and AI

Watch now

Introduction

Data Lifecycle - data enrichment. This tutorial will walk you through running a simple PySpark job to enrich your data using an existing data warehouse. We will use Cloudera Data Engineering on Cloudera on cloud.

Prerequisites

Have access to Cloudera on cloud
Have access to a virtual warehouse for your environment. If you need to create one, refer to From 0 to Query with Cloudera Data Warehouse
Have created a Cloudera workload User
Ensure proper Data Engineering role access
- DEAdmin: enable Data Engineering and create virtual clusters
- DEUser: access virtual cluster and run jobs
Basic AWS CLI skills

Outline

Watch video
Download assets
Set up Cloudera Data Engineering
Create and run jobs
Review job output
Summary
Further reading

Watch video

The video below provides a brief overview of what is covered in this tutorial:

Download assets

There are two (2) options in getting assets for this tutorial:

Download a ZIP file

It contains only necessary files used in this tutorial. Unzip tutorial-files.zip and remember its location.

Clone our GitHub repository

It provides assets used in this and other tutorials; organized by tutorial title.

Using AWS CLI, copy the following files to S3 bucket, defined by your environment’s storage.location.base attribute:

car_installs.csv
car_sales.csv
customer_data.csv
experimental_motors.csv
postal_codes.csv

Note: You may need to ask your environment's administrator to get property value for storage.location.base.

For example, property storage.location.base has value s3a://usermarketing-cdp-demo, therefore copy the files using the command:

aws s3 cp . s3://usermarketing-cdp-demo --recursive --exclude "*" --include "*.csv"

Set up Cloudera Data Engineering

Enable Cloudera Data Engineering service

If you don’t already have Data Engineering service enabled for your environment, let’s enable one.

Starting from Cloudera platform Home Page, select Data Engineering:

Click on to enable new Cloudera Data Engineering service
Name: data-engineering
Environment: <your environment name>
Workload Type: General - Small
Make other changes (optional)
Enable

Create Data Engineering virtual cluster

If you don’t already have a Data Engineering virtual cluster created, let’s create one.

Starting from Cloudera Data Engineering > Overview:

Click on to create cluster
Cluster Name: data-engineering
CDE Service: <your environment name>
Autoscale Max Capacity: CPU: 4, Memory 4 GB
Create

Create and run jobs

We will be using the GUI to run our jobs. If you would like to use the CLI, take a look at Using CLI-API to Automate Access to Cloudera Data Engineering.

In your virtual cluster, view jobs by selecting .

We will create and run two (2) jobs:

Pre-SetupDW

As a prerequisite, this PySpark job creates a data warehouse with mock sales, factory and customer data.

IMPORTANT: Before running the job, you need to modify one (1) variable in Pre-SetupDW.py. Update variable s3BucketName definition using storage.location.base attribute; defined by your environment.

EnrichData_ETL

Bring in data from Cloudera Data Warehouse (CDW), filter out non-representative data, and then join in sales, factory, and customer data together to create a new enriched table and store it back in CDW.

In the Jobs section, select Create Job to create a new job:

Name: Pre-SetupDW
Upload File: Pre-SetupDW.py (provided in download assets)
Select Python 3
Turn off Schedule
Create and Run

Give it a minute for this job to complete and create the next job:

Name: EnrichData_ETL
Upload File: EnrichData_ETL.py (provided in download assets)
Select Python 3
Turn off Schedule
Create and Run

Review job output

Let’s take a look at the job output generated.

First, let’s take a look at the output for Pre-SetupDW:

Select Job Runs tab.

Select the Run ID number for your Job name
Select Logs
Select stdout

The results should look like this:

Next, let’s take a look at the job output for EnrichData_ETL:

Select Job Runs tab.

Select the Run ID number for your Job name
Select Logs
Select stdout

The results should look like this:

Summary

Congratulations on completing the tutorial.

As you've now experienced, Cloudera Data Engineering Experience provides an easy way for developers to run workloads.