X

Cloudera Tutorials

Optimize your time with detailed tutorials that clearly explain the best way to deploy, use, and manage Cloudera products. Login or register below to access all Cloudera tutorials.

ClouderaNOW24  Product demos. Live Q&As. Exclusive sneak peeks  |  Oct 30

Register now

 

Introduction

 

See for yourself how easy it is to use Cloudera Machine Learning (CML) on Cloudera Data Platform Public Cloud (CDP-PC).

In this tutorial, we will use a publicly available jet engine dataset which simulates how jet engines degrade over time. We will create a predictive model to determine the jet's engine remaining useful life (RUL) and do cost benefit analysis.

 

Prerequisites

 

  • Have access to Cloudera Data Platform (CDP) Public Cloud
  • Have created a CDP workload User
  • Ensure proper CML role access
    • MLUser - ability to run workloads
    • MLAdmin - ability to create and delete workspaces

 

 

Watch Video

 

The video below provides a brief overview of what is covered in this tutorial:

 

 

Download Assets

 

Download tutorial files; remember this location. No need to unzip the folder.

For now, this is all we need to do. We will use these files later in the tutorial.

Note: Jet engine dataset is publicly available from Kaggle; a few alterations were made.

 

 

Setup Machine Learning Environment

 

Create Workspace

 

If you don’t already have a machine learning workspace provisioned for you, let’s create it.

Select Machine Learning from Cloudera Data Platform (CDP) home page:

 

cdp-home-ml

 

In the ML Workspaces section, select Provision Workspace:

 

cml-workspaces-provision-button

 

Two simple pieces of information are needed to a provision workspace - Workspace Name and Environment. For example:

  1. Workspace Name: cml-tutorial
  2. Environment: use your environment name
  3. Select Provision Workspace

 

cml-workspaces-provision-form

 

Create Project

 

Beginning from the ML Workspaces section:

  1. Open your workspace by selecting on its name: cml-tutorial

  2. Select New Project

 

cml-workspaces-open-workspace

 

Complete the New Project form using:

  1. Project Name: Jet Engine RUL

  2. Project Visibility: Public

  3. Initial Setup: Local
    Upload or Drag-Drop tutorial-files.zip you downloaded earlier

  4. Create Project

 

cml-new-project-form

 

Create Model

 

Now that we have a working environment, we'll create a session to begin building our regression model.

 

First, let's briefly go over jet engine data files we'll be working with:
There are four (4) scenarios - each consisting of three (3) files containing engine sensor data for about 100-300 engines:

training_FD00x.csv

We will use this data to train and create our regression model to estimate the jet engine's remaining useful life (RUL). Sensor data and RUL are included in the same file.

test_FD00x.csv
RUL_FD00x.txt

The sensor test data and RUL are in separate files. We will use this data to test our model and run a few calculations to validate accuracy and associated cost savings.
The cost savings is based on correctly predicting Cycle_Alert_Threshhold, which represents the number of flights before requiring maintenance.

Where x, in filename, is the scenario number (1, 2, 3 or 4).

 

Let's continue with the fun stuff - code!

Beginning from the Projects section, select the project name, Jet Engine RUL.

Select New Session and complete the session form:

  1. Session Name: Untitled Session

  2. Editor: Workbench

  3. Kernel: Python 3

  4. Engine Image: Default

  5. Resource Profile: Default (1 vCPU / 2 GiB Memory)

  6. Start Session

 

cml-new-session-form

 

Let’s open a terminal window by selecting, >_ Terminal Access and type:

sh cdsw-build.sh

This will install the dependent libraries needed for the project (sklearn, pandas, numpy and xgboost). Once it completes, close the terminal window.

 

cml-session-terminal

 

NOTE: You only need to install dependent libraries once - this step can be skipped in future sessions.

 

 

Data Cleansing

 

Before we create our model, we should take some time and understand the data provided. Sensor data that do not influence our model should be removed.
This is an iterative and time-consuming process - patience is a virtue!

 

File utils.py has two functions to assist in this process:

trainDataPlot() - visualize/plot sensor data column compared to RUL

trainDataCorrelation() - see correlation matrix/heatmap of sensor data, including RUL

 

Data cleansing has already been done for you - columns that do not influence our model have been commented out. You are encouraged to go through the exercise and verify.

 

Now that our data is cleansed, let's create our model:
From the list of files, select Jet_Engine-Modeling.py and click on to run the entire program.

 

run-program

 

The output (above) shows an accuracy of 93.00%, AUC of 0.89 and savings of $4.85M.

Now that we've verified one scenario, let's see how this model performs on the rest of the provided data. Let's run some experiments.

 

 

Run Experiments

 

Now that we have our model, let's run several experiments using other provided scenarios. Let's see how well we classify the engines that require maintenance, using Cycle_Alert_Threshhold of 40 and 50.

 

Beginning from the Projects section, select the project name, Jet Engine RUL.

Next, in the Experiments section, select Run Experiment and complete the Run New Experiment form:

  1. Script: Jet-Engine-Modeling.py

  2. Arguments: FD001 40

  3. Engine Kernel: Python 3

  4. Engine Profile: Default or smallest resource available

  5. Start Run

 

experiments-run-new

 

Run a few more experiments. Repeat the same steps, using the following arguments:

Arguments: FD001 50

Arguments: FD002 40

Arguments: FD002 50

Arguments: FD003 40

Arguments: FD003 50

Arguments: FD004 40

Arguments: FD004 50

 

 

Let's review our experiments - make sure to select all three (3) metrics: AUC, Accuracy and Savings.

Wow! Accuracy is between 87-96% and what about that money saved.

 

 

experiments-results

 

Summary

 

Congratulations on completing the tutorial.

Overall, our predictive model performed well in all scenarios - some performed better than others. Although data cleansing was done for you, what would happen to our model if we included more sensor data?

As you have seen, it is easy to use Cloudera Machine Learning (CML) on Cloudera Data platform. This is only the beginning - there is so much more to learn.

 

Your form submission has failed.

This may have been caused by one of the following:

  • Your request timed out
  • A plugin/browser extension blocked the submission. If you have an ad blocking plugin please disable it and close this message to reload the page.