Data Lifecycle - Predictive Analytics. Using synthetic datasets for an electric car company, we will predict the amount of inventory that may be required for a specific part based on historical part consumption. We will create a pipeline that when new data is collected, it will automatically predict inventory needed for seven (7), fourteen (14) and twenty-one (21) days.
If your environment doesn’t already have a Machine Learning Workspace provisioned, let’s provision it.
Select Machine Learning from Cloudera Data Platform (CDP) home page:
In the ML Workspaces section, select Provision Workspace.
Two simple pieces of information are needed to provision an ML workspace - Workspace Name and the Environment name. For example:
Electric-Car-ML
Beginning from the ML Workspaces section:
Electric-Car-ML
Complete the New Project form using:
Electric-Car-Inventory
This project houses machine learning code that creates models that predict the amount of inventory for any given part that should be produced based on historical part consumption and future car production rate. All data within this project is synthetic.
Now that we have a working machine learning environment, let’s create a session. We will create a workbench editor with a Python 3 interpreter to create and test our model.
Beginning from the Projects section, select your project name, Electric-Car-Inventory.
Select New Session and complete the session form:
Build Model
In the bottom of the session, we will find a command line prompt. Issue the following command to install the dependent libraries for this project (xgboost, sklearn and Faker):
!sh ./cdsw-build.sh
NOTE: You only need to install dependent libraries once - this step can be skipped on future sessions.
Select file, Part_Modeling.py and click on to run the entire program.
Using the datasets provided, a linear regression model will be created and tested.
Now that we’ve created our model, we no longer need this session - select Stop to terminate the session.
We will create a machine learning pipeline that consists of four (4) jobs. When the main job, Collect Data is run, dependent jobs will automatically run; each predicting the number of parts needed in that time.
Beginning from the Jobs section, select New Job and fill out the form as follows:
Data Collection
data_generator.py
Repeat using:
EV Part Forecast 7 Days
Part_Modeling.py
12 15 18 a42CLDR 7
Repeat using:
EV Part Forecast 14 Days
Part_Modeling.py
12 15 18 a42CLDR 14
Repeat using:
Name: EV Part Forecast 21 Days
Part_Modeling.py
12 15 18 a42CLDR 21
Your machine learning pipeline should look like this:
Now that you’ve created your pipeline, whenever you manually run the Data Collection job, the other three forecasting jobs will run automatically. To see the prediction results for each job, select the job’s name. Take a look at the History and select the Run you’re interested in.
Let’s create three (3) models so that they could be run programmatically from somewhere else.
Beginning from the Models section, select New Model and fill out the form as follows:
EV Part Prediction Model 7 Days
One week prediction
{ "model_C_sales": "40"
, "model_D_sales": "82"
, "model_R_sales":"34"
, "part_no": "a42CLDR"
, "time_delta": "7"}
Repeat using:
EV Part Prediction Model 14 Days
Two week prediction
{ "model_C_sales": "40"
, "model_D_sales": "82"
, "model_R_sales":"34"
, "part_no": "a42CLDR"
, "time_delta": "14"}
Repeat using:
EV Part Prediction Model 21 Days
Three week prediction
{ "model_C_sales": "40"
, "model_D_sales": "82"
, "model_R_sales":"34"
, "part_no": "a42CLDR"
, "time_delta": "21"}
Your models should look like this:
You can now run your model programmatically from almost anywhere - shell, Python, R. Select a model name to test the API.
Videos
Blogs
Meetup
Other
This may have been caused by one of the following: