Cloudera Tutorials

Optimize your time with detailed tutorials that clearly explain the best way to deploy, use, and manage Cloudera products. Login or register below to access all Cloudera tutorials.

ClouderaNOW24     See the latest Cloudera Innovations

Watch now




This tutorial will walk you through running a simple Apache Spark ETL job using Cloudera Data Engineering (CDE) on Cloudera Data Platform - Public Cloud (CDP-PC).





  • Have access to Cloudera Data Platform (CDP) Public Cloud with a Data Lake running.
  • Basic AWS CLI skills
  • Ensure proper CDE role access
    • DEAdmin: enable CDE and create virtual clusters
    • DEUser: access virtual cluster and run jobs



Watch Video


The video below provides a brief overview of what is covered in this tutorial:



Download Assets


Download and unzip tutorial files; remember location where you extracted the files.

Using AWS CLI, copy file access-log.txt to your S3 bucket, s3a://<storage.location>/tutorial-data/data-engineering, where <storage.location> is your environment’s property value for storage.location.base. In this example, property storage.location.base has value s3a://usermarketing-cdp-demo, therefore the command will be:


aws s3 cp access.log.txt s3://usermarketing-cdp-demo/tutorial-data/data-engineering/access-log.txt




Enable Cloudera Data Engineering (CDE)


If you don’t already have Cloudera Data Engineering (CDE) service enabled for your environment, let’s enable one.

Starting from Cloudera Data Platform (CDP) Home Page, select Data Engineering:




  1. Click on plus icon to enable new Cloudera Data Engineering (CDE)
  2. Provide the environment name: usermarketing
  3. Workload Type: General - Small
  4. Set Auto-Scale Range: Min 1, Max 20




Create Data Engineering Virtual Cluster


  1. Click on plus icon to create cluster
  2. Cluster name: usermarketing-cde-demo
  3. CDE Service: usermarketing
  4. Auto-Scale Range: CPU Max 4, Memory Max 4 GB
  5. Create




Create and Schedule a Job


You can schedule a job to be run periodically or just run it once. We will take a look at both methods.

Click on  for View Jobs.




In the Jobs section, select Create Job to create a job, access-logs-ETL - fill out job details:

  1. Name: access-logs-ETL
  2. Upload File: access-logs-ETL.py, from tutorial files provided
  3. Select Python 3
  4. Turn off Schedule
  5. Create and Run




Let’s take a look at the job output generated. In the Job Runs section, select the Run ID for the Job you are interested in. In this case, let’s select Run ID 11 associated with Job access-logs-ETL.




Let’s take a look at the job output. Here you can see all of the output from the Spark job that has just been run. You can see that this spark job prints some user-friendly segments of the data being processed so the data engineer can validate that the process is working correctly.

Select Logs > stdout




Let’s take a deeper look and see the different stages of the job. You can see that the Spark job has been split into multiple stages. You can zoom into each stage getting utilization details on each stage. These details will help the data engineer to validate the job is working correctly and utilizing the right amount of resources.

You are encouraged to explore all the stages of the job.

Select Analysis.




Once you are satisfied with the application and its output, we can schedule it to run periodically based on time interval.

In the Jobs section, select three dots icon next to the job you’d like to schedule runs.

Select Add Schedule.




  1. Select Edit
  2. Set schedule: Every hour at 0 minute(s) past the hour 
  3. Add Schedule






Congratulations on completing the tutorial.

As you've now experienced, Cloudera Data Engineering Experience (CDE) provides an easy way for developers to run workloads and to schedule them to run periodically.



Further Reading





Your form submission has failed.

This may have been caused by one of the following:

  • Your request timed out
  • A plugin/browser extension blocked the submission. If you have an ad blocking plugin please disable it and close this message to reload the page.