Cloudera Tutorials

Optimize your time with detailed tutorials that clearly explain the best way to deploy, use, and manage Cloudera products. Login or register below to access all Cloudera tutorials.

ClouderaNOW24     See the latest Cloudera Innovations

Watch now




Learn how to interact with Cloudera Data Engineering (CDE) on Cloudera Data Platform - Public Cloud (CDP-DC) using command line interface (CLI) and restful APIs.






Watch Video


The video below provides a brief overview of what is covered in this tutorial:



Download Assets


You have two (2) options to get the assets needed for this tutorial:

  1. Download a ZIP file

It contains only necessary files used in this tutorial. Unzip tutorial-files.zip and remember its location.

  1. Clone our GitHub repository

It provides assets used in this and other tutorials; organized by tutorial title.


Create folder: $HOME/.cde

mkdir -p $HOME/.cde

Move config.yaml to .cde folder:

mv config.yaml $HOME/.cde/


Using AWS CLI, copy data files PPP-Over-150k-ALL.csv and PPP-Sub-150k-TX.csv to S3 bucket, s3a://<storage.location>/tutorial-data/data-engineering, where <storage.location> is your environment’s property value for storage.location.base.

Note: You may need to ask your environment's administrator to get property value for storage.location.base.


In this example, property storage.location.base has value s3a://usermarketing-cdp-demo, therefore the command will be:

aws s3 cp PPP-Over-150k-ALL.csv s3://usermarketing-cdp-demo/tutorial-data/data-engineering/PPP-Over-150k-ALL.csv


aws s3 cp PPP-Sub-150k-TX.csv s3://usermarketing-cdp-demo/tutorial-data/data-engineering/PPP-Sub-150k-TX.csv


Note: datasets are publicly available from the U.S. Department of the Treasury -  Paycheck Protection Program (PPP).




Setup CLI


Download CLI Tool:

Beginning from CDP Home page, select Data Engineering



Navigate to Environments (usermarketing) > Virtual Cluster (usermarketing-cde-demo) > Cluster Details




Download CLI TOOL based on your operating system.



Note: CDE client must be stored where it can be found via $PATH (i.e. /usr/bin) and have execute privileges.



Configure CLI Client


Modify config.yaml, received in Download Assets, as follows:

  1. Replace <cdp-workload-user> with your CDP workload user name
  2. Replace <CDE-virtual-cluster-endpoint> with JOBS API URL provided in Cluster Details



The edited file should look something like:



Run Jobs using CLI


We can run a job immediately (ad-hoc), which is good for testing your application. Another option is to define a resource, which stores a collection of Python files or applications required for a job; great for running jobs periodically.


Run a Job (ad-hoc)


Run Spark Job:

cde spark submit --conf "spark.pyspark.python=python3" Data_Extraction_Sub_150k.py

Check Job Status:

cde run describe --id #, where # is the job id

Review the Output:

cde run logs --type "driver/stdout" --id #, where # is the job id




Create a Resource


Create resource:

cde resource create --name "cde_ETL"

Upload file(s) to resource:

cde resource upload --local-path "Data_Extraction*.py" --name "cde_ETL"

Verify resource:

cde resource describe --name "cde_ETL"




Create and Schedule Jobs


Let's schedule two (2) jobs. The jobs have a dependency on an Apache Hive table, therefore we’ll schedule them a few minutes apart.

Schedule the jobs:

cde job create --name "Over_150K_ETL" \
          --type spark \
          --conf "spark.pyspark.python=python3" \
          --application-file "Data_Extraction_Over_150k.py" \
          --cron-expression "0 */1 * * *" \
          --schedule-enabled "true" \
          --schedule-start "2020-08-18" \
          --schedule-end "2021-08-18" \
          --mount-1-resource "cde_ETL"
cde job create --name "Sub_150K_ETL" \
          --type spark \
          --conf "spark.pyspark.python=python3" \
          --application-file "Data_Extraction_Sub_150k.py" \
          --cron-expression "15 */1 * * *" \
          --schedule-enabled "true" \
          --schedule-start "2020-08-18" \
          --schedule-end "2021-08-18" \
          --mount-1-resource "cde_ETL"


Confirm scheduling:

cde job list --filter 'name[like]%ETL%'




View Job Runs:

cde run list --filter 'job[like]%ETL%'




Review the Output:

cde run logs --type "driver/stdout" --id #, where # is the job id






Cloudera Data Engineering uses JSON Web Tokens (JWT) for API authentication. To interact with a virtual cluster using the API, you must obtain and define the access token for that cluster.

We will define two (2) convenient environment variables:



export CDE_JOB_URL="<jobs_api_url>", where <jobs_api_url> is a link found in Cluster Details


The access token, CDE_TOKEN, is a composition of hostname, literal value and parsed JSON value. The following command simplifies it for you. Refer to Getting a Cloudera Data Engineering API access token for details.

export CDE_TOKEN=$(curl -u <workload_user> $(echo '<grafana_charts>' | cut -d'/' -f1-3 | awk '{print $1"/gateway/authtkn/knoxtoken/api/v1/token"}') | jq -r '.access_token')


<workload_user> is CDP workload user
<grafana_charts> is a link found in Cluster Details


NOTE: When the token expires, just re-run this command.




Now we have everything we need to make REST API calls. You can test and view API documentation directly in the virtual cluster by selecting the API DOC link found in Cluster Details.




The command needed to make any REST API call is:

curl -H "Authorization: Bearer ${CDE_TOKEN}" -X <request_method> "${CDE_JOB_URL}/<api_command>" <api_options> | jq .


<request_method> is DELETE, GET, PATCH, POST or PUT; depending on your request
<api_command> is the command you’d like to execute from API DOC
<api_options> are the required options for requested command



Run Jobs using REST API


Let’s create a resource, cde_REPORTS, which will hold a Python program Create_Reports.py.


curl -H "Authorization: Bearer ${CDE_TOKEN}" -X POST "${CDE_JOB_URL}/resources" -H "Content-Type: application/json" -d "{ \"name\": \"cde_REPORTS\"}"


curl -H "Authorization: Bearer ${CDE_TOKEN}" -X PUT "${CDE_JOB_URL}/resources/cde_REPORTS/Create_Reports.py" -F 'file=@/home/gdeleon/tmp/Create_Reports.py'


Let’s verify:

curl -H "Authorization: Bearer ${CDE_TOKEN}" -X GET "${CDE_JOB_URL}/resources/cde_REPORTS" | jq .




Let’s schedule a job, Create_Report, to run every thirty minutes past the hour:

curl -H "Authorization: Bearer ${CDE_TOKEN}" -X POST "${CDE_JOB_URL}/jobs" -H "accept: application/json" -H "Content-Type: application/json" -d "{ \"name\": \"Create_Report\", \"type\": \"spark\", \"retentionPolicy\": \"keep_indefinitely\", \"mounts\": [ { \"dirPrefix\": \"/\", \"resourceName\": \"cde_REPORTS\" } ], \"spark\": { \"file\": \"Create_Reports.py\", \"conf\": { \"spark.pyspark.python\": \"python3\" } }, \"schedule\": { \"enabled\": true, \"user\": \"gdeleon\", \"cronExpression\": \"30 */1 * * *\", \"start\": \"2020-08-18\", \"end\": \"2021-08-18\" } }"


Let’s take a look at the most recent job execution:

curl -H "Authorization: Bearer ${CDE_TOKEN}" -X GET "${CDE_JOB_URL}/jobs?latestjob=true&filter=name%5Beq%5DCreate_Report&limit=20&offset=0&orderby=name&orderasc=true" | jq .


Let’s review the job output:

curl -H "Authorization: Bearer ${CDE_TOKEN}" -X GET "${CDE_JOB_URL}/job-runs/<JOB_ID>/logs?type=driver%2Fstdout"






Congratulations on completing the tutorial.

You have learned to interact with Cloudera Data Engineering (CDE) using both the command line interface (CLI) and restful APIs. You are encouraged to incorporate what you’ve learned into your favorite continuous integration (CI) tool.


Your form submission has failed.

This may have been caused by one of the following:

  • Your request timed out
  • A plugin/browser extension blocked the submission. If you have an ad blocking plugin please disable it and close this message to reload the page.