X

Cloudera Tutorials

Optimize your time with detailed tutorials that clearly explain the best way to deploy, use, and manage Cloudera products. Login or register below to access all Cloudera tutorials.

By registering or submitting your data, you acknowledge, understand, and agree to Cloudera's Terms and Conditions, including our Privacy Statement.
By checking this box, you consent to receive marketing and promotional communications about Cloudera’s products and services and/or related offerings from us, or sent on our behalf, in accordance with our Privacy Statement. You may withdraw your consent by using the unsubscribe or opt-out link in our communications.

Cloudera acquires Octopai's platform to enhance metadata management capabilities

Read the press release

 

Introduction

 

Data Lifecycle - Predictive Analytics. Using synthetic datasets for an electric car company, we will predict the amount of inventory that may be required for a specific part based on historical part consumption. We will create a pipeline that when new data is collected, it will automatically predict inventory needed for seven (7), fourteen (14) and twenty-one (21) days.

 

 

Prerequisites

 

 

 

Watch Video

 

The video below provides a brief overview of what is covered in this tutorial:

 

 

Download Assets

 

There are two (2) options in getting assets for this tutorial:

  1. Download a ZIP file

It contains all necessary files used in this tutorial. Remember its location. No need to unzip the file.

  1. Clone GitHub repository

It provides assets used in this and other tutorials; organized by tutorial title.

 

 

Create ML Workspace

 

If your environment doesn’t already have a Machine Learning Workspace provisioned, let’s provision it.

Select Machine Learning from Cloudera home page:

 

cdp-main-menu.

In the ML Workspaces section, select Provision Workspace.

Two simple pieces of information are needed to provision an ML workspace - Workspace Name and the Environment name. For example:

  1. Workspace Name: Electric-Car-ML
  2. Environment: <your environment name>
  3. Select Provision Workspace

 

cml-workspace-provision

 

Create Project

 

Beginning from the ML Workspaces section:

  1. Open your workspace by selecting on its name: Electric-Car-ML
  2. Select New Project

Complete the New Project form using:

  1. Project Name: Electric-Car-Inventory
  2. Project Description:
    This project houses machine learning code that creates models that predict the amount of inventory for any given part that should be produced based on historical part consumption and future car production rate. All data within this project is synthetic.
  1. Initial Setup: Local Files
    Upload or Drag-Drop tutorial-files.zip you downloaded earlier
  2. Select Create Project

 

cml-new-project

 

Create and Test Model

 

Now that we have a working machine learning environment, let’s create a session. We will create a workbench editor with a Python 3 interpreter to create and test our model.

 

Beginning from the Projects section, select your project name, Electric-Car-Inventory.

Select New Session and complete the session form:

  1. Session Name: Build Model
  2. Editor: Workbench
  3. Kernel: Python 3
  4. Engine Image: <use default>
  5. Resource Profile: <use default, 1 vCPU / 2 GiB Memory>
  6. Select Start Session

 

cml-new-session

In the bottom of the session, we will find a command line prompt. Issue the following command to install the dependent libraries for this project (xgboost, sklearn and Faker):

!sh ./cdsw-build.sh

NOTE: You only need to install dependent libraries once - this step can be skipped on future sessions.

 

cml-install-dependencies

Select file, Part_Modeling.py and click on  to run the entire program.

Using the datasets provided, a linear regression model will be created and tested.

Now that we’ve created our model, we no longer need this session - select Stop to terminate the session.

 

cml-create-test-model

 

Create Jobs with Dependencies

 

 

We will create a machine learning pipeline that consists of four (4) jobs. When the main job, Collect Data is run, dependent jobs will automatically run; each predicting the number of parts needed in that time.

Beginning from the Jobs section, select New Job and fill out the form as follows:

  1. Name: Data Collection
  2. Script: data_generator.py
  3. Arguments: <leave blank>
  4. Engine Kernel: Python 3
  5. Schedule: Manual
  6. Select Create Job

Repeat using:

  1. Name: EV Part Forecast 7 Days
  2. Script: Part_Modeling.py
  3. Arguments: 12 15 18 a42CLDR 7
  4. Engine Kernel: Python 3
  5. Schedule: Dependent, Data Collection
  6. Select Create Job

Repeat using:

  1. Name: EV Part Forecast 14 Days
  2. Script: Part_Modeling.py
  3. Arguments: 12 15 18 a42CLDR 14
  4. Engine Kernel: Python 3
  5. Schedule: Dependent, Data Collection
  6. Select Create Job

Repeat using:

  1. Name: EV Part Forecast 21 Days

  2. Script: Part_Modeling.py
  3. Arguments: 12 15 18 a42CLDR 21
  4. Engine Kernel: Python 3
  5. Schedule: Dependent, Data Collection
  6. Select Create Job

 

cml-create-job-dependencies

Your  machine learning pipeline should look like this:

 

cml-job-pipeline

Now that you’ve created your pipeline, whenever you manually run the Data Collection job, the other three forecasting jobs will run automatically. To see the prediction results for each job, select the job’s name. Take a look at the History and select the Run you’re interested in.

 

 

Deploy Model

 

Let’s create three (3) models so that they could be run programmatically from somewhere else.

Beginning from the Models section, select New Model and fill out the form as follows:

  1. Name: EV Part Prediction Model 7 Days
  2. Description: One week prediction
  3. Disable Authentication
  4. File: model-wrapper.py
  5. Function: PredictFunc
  6. Example Input: 
    {  "model_C_sales": "40"
    ,  "model_D_sales": "82"
    ,  "model_R_sales":"34"
    ,  "part_no": "a42CLDR"
    ,  "time_delta": "7"}
  7. Kernel: Python 3
  8. Select Deploy Model

Repeat using:

  1. Name: EV Part Prediction Model 14 Days
  2. Description: Two week prediction
  3. Disable Authentication
  4. File: model-wrapper.py
  5. Function: PredictFunc
  6. Example Input: 
    {  "model_C_sales": "40"
    ,  "model_D_sales": "82"
    ,  "model_R_sales":"34"
    ,  "part_no": "a42CLDR"
    ,  "time_delta": "14"}
  7. Kernel: Python 3
  8. Select Deploy Model

Repeat using:

  1. Name: EV Part Prediction Model 21 Days
  2. Description: Three week prediction
  3. Disable Authentication
  4. File: model-wrapper.py
  5. Function: PredictFunc
  6. Example Input: 
    {  "model_C_sales": "40"
    ,  "model_D_sales": "82"
    ,  "model_R_sales":"34"
    ,  "part_no": "a42CLDR"
    ,  "time_delta": "21"}
  7. Kernel: Python 3
  8. Select Deploy Model

 

cml-model-deployment

Your models should look like this:

 

cml-created-models

You can now run your model programmatically from almost anywhere - shell, Python, R. Select a model name to test the API.

 

cml-model-prediction-result

 

Summary

 

Congratulations on completing the tutorial.

As you have seen, it is easy to use Cloudera Machine Learning (CML) on Cloudera Data platform. This is only the beginning - there is so much more to learn.

 

Your form submission has failed.

This may have been caused by one of the following:

  • Your request timed out
  • A plugin/browser extension blocked the submission. If you have an ad blocking plugin please disable it and close this message to reload the page.