Cloudera Tutorials

Optimize your time with detailed tutorials that clearly explain the best way to deploy, use, and manage Cloudera products. Login or register below to access all Cloudera tutorials.

ClouderaNOW24     See the latest Cloudera Innovations

Watch now




Data Lifecycle - data enrichment. This tutorial will walk you through running a simple PySpark job to enrich your data using an existing data warehouse. We will use Cloudera Data Engineering (CDE) on Cloudera Data Platform - Public Cloud (CDP-PC).





  • Have access to Cloudera Data Platform (CDP) Public Cloud
  • Have access to a virtual warehouse for your environment. If you need to create one, refer to From 0 to Query with Cloudera Data Warehouse
  • Have created a CDP workload User
  • Ensure proper CDE role access
    • DEAdmin: enable CDE and create virtual clusters
    • DEUser: access virtual cluster and run jobs
  • Basic AWS CLI skills



Watch Video


The video below provides a brief overview of what is covered in this tutorial:



Download Assets


There are two (2) options in getting assets for this tutorial:

  1. Download a ZIP file

It contains only necessary files used in this tutorial. Unzip tutorial-files.zip and remember its location.

  1. Clone our GitHub repository

It provides assets used in this and other tutorials; organized by tutorial title.


Using AWS CLI, copy the following files to S3 bucket, defined by your environment’s storage.location.base attribute:


Note: You may need to ask your environment's administrator to get property value for storage.location.base.


For example, property storage.location.base has value s3a://usermarketing-cdp-demo, therefore copy the files using the command:

aws s3 cp . s3://usermarketing-cdp-demo --recursive --exclude "*" --include "*.csv"




Setup Cloudera Data Engineering (CDE)


Enable CDE Service


If you don’t already have Cloudera Data Engineering (CDE) service enabled for your environment, let’s enable one.

Starting from Cloudera Data Platform (CDP) Home Page, select Data Engineering:




  1. Click on  to enable new Cloudera Data Engineering (CDE) service
  2. Name: data-engineering
  3. Environment: <your environment name>
  4. Workload Type: General - Small
  5. Make other changes (optional)
  6. Enable




Create Data Engineering Virtual Cluster


If you don’t already have a CDE virtual cluster created, let’s create one.

Starting from Cloudera Data Engineering > Overview:

  1. Click on  to create cluster
  2. Cluster Name: data-engineering
  3. CDE Service: <your environment name>
  4. Autoscale Max Capacity: CPU: 4, Memory 4 GB
  5. Create



Create and Run Jobs


We will be using the GUI to run our jobs. If you would like to use the CLI, take a look at Using CLI-API to Automate Access to Cloudera Data Engineering.

In your virtual cluster, view jobs by selecting .




We will create and run two (2) jobs:

  • Pre-SetupDW

As a prerequisite, this PySpark job creates a data warehouse with mock sales, factory and customer data.

IMPORTANT: Before running the job, you need to modify one (1) variable in Pre-SetupDW.py. Update variable s3BucketName definition using storage.location.base attribute; defined by your environment.

  • EnrichData_ETL

Bring in data from Cloudera Data Warehouse (CDW), filter out non-representative data, and then join in sales, factory, and customer data together to create a new enriched table and store it back in CDW.


In the Jobs section, select Create Job to create a new job:

  1. Name: Pre-SetupDW
  2. Upload File: Pre-SetupDW.py (provided in download assets)
  3. Select Python 3
  4. Turn off Schedule
  5. Create and Run


Give it a minute for this job to complete and create the next job:

  1. Name: EnrichData_ETL
  2. Upload File: EnrichData_ETL.py (provided in download assets)
  3. Select Python 3
  4. Turn off Schedule
  5. Create and Run




Review Job Output


Let’s take a look at the job output generated.


First, let’s take a look at the output for Pre-SetupDW:

Select Job Runs tab.

  1. Select the Run ID number for your Job name
  2. Select Logs
  3. Select stdout


The results should look like this:




Next, let’s take a look at the job output for EnrichData_ETL:

Select Job Runs tab.

  1. Select the Run ID number for your Job name
  2. Select Logs
  3. Select stdout


The results should look like this:






Congratulations on completing the tutorial.

As you've now experienced, Cloudera Data Engineering Experience (CDE) provides an easy way for developers to run workloads.


Your form submission has failed.

This may have been caused by one of the following:

  • Your request timed out
  • A plugin/browser extension blocked the submission. If you have an ad blocking plugin please disable it and close this message to reload the page.