Cloudera Tutorials

Optimize your time with detailed tutorials that clearly explain the best way to deploy, use, and manage Cloudera products. Login or register below to access all Cloudera tutorials.

First Name

Last Name

Job Title

Business Email

Company

Phone

Country

By registering or submitting your data, you acknowledge, understand, and agree to Cloudera's Terms and Conditions, including our Privacy Statement.

By checking this box, you consent to receive marketing and promotional communications about Cloudera’s products and services and/or related offerings from us, or sent on our behalf, in accordance with our Privacy Statement. You may withdraw your consent by using the unsubscribe or opt-out link in our communications.

Back to main tutorial page

Cloudera acquires Taikun to deliver the cloud experience to data anywhere for AI everywhere.

Read press release

Introduction

This tutorial will walk you through running a simple Apache Spark ETL job using Cloudera Data Engineering on Cloudera on cloud.

Prerequisites

Have access to Cloudera on cloud with a data lake running.
Basic AWS CLI skills
Ensure proper Data Engineering role access
- DEAdmin: enable Data Engineering and create virtual clusters
- DEUser: access virtual cluster and run jobs

Outline

Watch video
Download assets
Enable Cloudera Data Engineering
Create Data Engineering virtual cluster
Create and schedule a job
Summary
Further reading

Watch video

The video below provides a brief overview of what is covered in this tutorial:

Download assets

Download and unzip tutorial files; remember location where you extracted the files.

Using AWS CLI, copy file access-log.txt to your S3 bucket, s3a://<storage.location>/tutorial-data/data-engineering, where <storage.location> is your environment’s property value for storage.location.base. In this example, property storage.location.base has value s3a://usermarketing-cdp-demo, therefore the command will be:

aws s3 cp access.log.txt s3://usermarketing-cdp-demo/tutorial-data/data-engineering/access-log.txt

Enable Cloudera Data Engineering

If you don’t already have Cloudera Data Engineering service enabled for your environment, let’s enable one.

Starting from the Cloudera Home Page, select Data Engineering:

Click on to enable new Cloudera Data Engineering
Provide the environment name: usermarketing
Workload Type: General - Small
Set Auto-Scale Range: Min 1, Max 20

Create Data Engineering virtual cluster

Click on to create cluster
Cluster name: usermarketing-cde-demo
Data Engineering Service: usermarketing
Auto-Scale Range: CPU Max 4, Memory Max 4 GB
Create

Create and schedule a job

You can schedule a job to be run periodically or just run it once. We will take a look at both methods.

Click on for View Jobs.

In the Jobs section, select Create Job to create a job, access-logs-ETL - fill out job details:

Name: access-logs-ETL
Upload File: access-logs-ETL.py, from tutorial files provided
Select Python 3
Turn off Schedule
Create and Run

Let’s take a look at the job output generated. In the Job Runs section, select the Run ID for the Job you are interested in. In this case, let’s select Run ID 11 associated with Job access-logs-ETL.

Let’s take a look at the job output. Here you can see all of the output from the Spark job that has just been run. You can see that this spark job prints some user-friendly segments of the data being processed so the data engineer can validate that the process is working correctly.

Select Logs > stdout

Let’s take a deeper look and see the different stages of the job. You can see that the Spark job has been split into multiple stages. You can zoom into each stage getting utilization details on each stage. These details will help the data engineer to validate the job is working correctly and utilizing the right amount of resources.

You are encouraged to explore all the stages of the job.

Select Analysis.

Once you are satisfied with the application and its output, we can schedule it to run periodically based on time interval.

In the Jobs section, select next to the job you’d like to schedule runs.

Select Add Schedule.

Select Edit
Set schedule: Every hour at 0 minute(s) past the hour
Add Schedule

Summary

Congratulations on completing the tutorial.

As you've now experienced, Cloudera Data Engineering provides an easy way for developers to run workloads and to schedule them to run periodically.