Data Lifecycle - data enrichment. This tutorial will walk you through running a simple PySpark job to enrich your data using an existing data warehouse. We will use Cloudera Data Engineering (CDE) on Cloudera on cloud.
There are two (2) options in getting assets for this tutorial:
It contains only necessary files used in this tutorial. Unzip tutorial-files.zip and remember its location.
It provides assets used in this and other tutorials; organized by tutorial title.
Using AWS CLI, copy the following files to S3 bucket, defined by your environment’s storage.location.base attribute:
car_installs.csv
car_sales.csv
customer_data.csv
experimental_motors.csv
postal_codes.csv
Note: You may need to ask your environment's administrator to get property value for storage.location.base.
For example, property storage.location.base has value s3a://usermarketing-cdp-demo, therefore copy the files using the command:
aws s3 cp . s3://usermarketing-cdp-demo --recursive --exclude "*" --include "*.csv"
data-engineering
If you don’t already have a CDE virtual cluster created, let’s create one.
Starting from Cloudera Data Engineering > Overview:
data-engineering
4
, Memory 4
GB
We will be using the GUI to run our jobs. If you would like to use the CLI, take a look at Using CLI-API to Automate Access to Cloudera Data Engineering.
In your virtual cluster, view jobs by selecting .
We will create and run two (2) jobs:
As a prerequisite, this PySpark job creates a data warehouse with mock sales, factory and customer data.
IMPORTANT: Before running the job, you need to modify one (1) variable in Pre-SetupDW.py. Update variable s3BucketName definition using storage.location.base attribute; defined by your environment.
Bring in data from Cloudera Data Warehouse (CDW), filter out non-representative data, and then join in sales, factory, and customer data together to create a new enriched table and store it back in CDW.
In the Jobs section, select Create Job to create a new job:
Pre-SetupDW
Give it a minute for this job to complete and create the next job:
EnrichData_ETL
Next, let’s take a look at the job output for EnrichData_ETL:
Select Job Runs tab.
The results should look like this:
Tutorials
Other
This may have been caused by one of the following: