Cloudera Tutorials

Optimize your time with detailed tutorials that clearly explain the best way to deploy, use, and manage Cloudera products. Login or register below to access all Cloudera tutorials.

Cloudera named a leader in 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems Get the report


In this tutorial, you will learn about the linear regression model, using Cloudera Machine Learning (CML), an experience on Cloudera Data Platform (CDP). If you are aspired to become a data scientist, linear regression is the first algorithm you need to master. Linear regression is very simple to understand, and it is a very powerful algorithm that is used today by many firms to help with decision making.

Despite it being a very simple algorithm to start with, the optimization of the model is very important to gain higher accuracy. In this tutorial, you will learn important concepts of regression analysis using Python. By the end of this tutorial you’ll be able to understand how to build, visualize, and evaluate models on your own. We will be focusing on simple linear regression and multiple linear regression models in this tutorial.



  • To better understand this tutorial, you should have a basic knowledge of statistics and linear algebra.


    You should have Python3 session setup in CML before you start implementing the code.



  • Understanding concepts behind linear regression

  • Implementation of linear regression using scikit-learn

  • Advanced section: mathematical approach

  • Conclusion

1.Understanding concepts behind linear regression


Linear regression is a technique for estimating linear relationships between various features and a continuous target variable. Regression means estimating a continuous real-value output. For example, if you have data that contains selling prices of houses in your city, you can estimate the selling price of your house based on that data and understand the market. Regression analysis is a subfield of Supervised Learning. Some of the questions that regression can answer  if you are dealing with housing data are as follows:

  • How much more can I sell my house for with an additional bedroom and bathroom?

  • Do houses located near malls sell for more or less than others?

  • What is the impact of lot size on housing prices?

Let’s understand simple linear regression using an example: Suppose you have the data for employees’ years of experience and their corresponding salaries. In order to perform analysis on this data, you need to make sure you collect the right data, clean the data, and transform the data. It is very important to perform exploratory data analysis to observe the distribution of the data and analyze the patterns. Fig 1 shows the plotted employee data, in which the x axis corresponds to years of experience, and the y axis corresponds to salary. Each point (x,y) denotes a training example. Let’s consider ‘t’ to be the total number of examples in the dataset. 


You are going to fit a line in the form of y = β0 + β1x  to your data. Here we call x the independent variable or input variable; y is called the dependent variable or output variable. β1 represents the slope of the line; β0 represents the intercept of the line. The goal of the algorithm is to find a line that best fits the set of points to minimize the error (the error is the distance from the line to the points). The line moves according to the parameter values (β0,β1). 

Fig2 : Error calculated as the vertical distance between regression line to the actual values

Key terminology to understand when building a linear model:

Hypothesis function refers to the equation h(x)= β0 + β1x which maps input x (years of experience) to target y (salary). Now the question is how do we find the best parameters to fit the line with minimum error? The concepts below help us to answer this question.

Cost Function: The cost function determines how to fit the best line to the data. For hypothesis function h(x) =  β0 + β1x choose β0,β1 so that h(x) is close to y for training examples (x,y). This represents the optimization problem to solve. Regression uses ordinary least squares methodology which is the summation of the squared difference between actual value(y) and the predicted value(y') which can be denoted as ∑(y - y')² for all training examples t in the dataset  to find the best possible value of the regression coefficients β0, β1. You might question why we are squaring the error instead of the absolute value: because squared error has nice mathematical properties that make it easier to differentiate and compute gradient descent, and the least squares method is easy to analyze and interpret. The cost function for linear regression is represented as:

1/(2t) ∑([h(x) - y']² for all training examples(t) 

Here t represents the number of training examples in the dataset, h(x) represents the hypothesis function defined earlier ( β0 + β1x), and y' represents predicted value. The average is taken for the cost function for easy math calculations.

Gradient descent is one of the methods that can be used to reduce the error, which helps by taking steps in the direction of a negative gradient. Gradient descent is an optimization algorithm that tweaks its parameters iteratively. In machine learning, gradient descent is used to update parameters in a model. Let us relate gradient descent with a real-life analogy for better understanding. Think of a valley you would like to descend, you first will take a step and look for the slope of the valley, whether it goes up or down. Once you are sure of the downward slope you will follow that and repeat the step again and again until you have descended completely (or reached the minima).Gradient descent does exactly the same.

Learning rate: Gradient descent is vector valued function hence it has both magnitude and dimension. The algorithm multiples the gradient by a number(learning rate) to determine the next point to take a step. For example having a gradient with a magnitude 5.3 and learning rate 0.01 then the gradient descent algorithm will pick the next point 0.053 away from the previous point. Choosing the learning rate is an important step as big learning rate can skip the minima and very small learning rate can take infinite time to reach the minima as shown in the below figure.



2. Implementation of linear regression using scikit-learn

Cloudera Machine Learning (CML) is a secure enterprise data science platform that enables data scientists to accelerate their workflow from exploration to production by providing them with their own analytics pipelines. CML allows data scientists to utilize already existing skills and tools, such as Python, R, and Scala, to run computations in Hadoop clusters. If you are new to CML, feel free to check out Tour of Data Science Work Bench to start using it and to set up your environment.  

CML allows you to run your code as a session or a job. A session is a way to interpret your code interactively, whereas a job allows you to execute your code as a batch process and can be scheduled to run recursively.

In order for us to use the Python script needed for this tutorial, select a Python 3 engine with this resource allocation configuration:

  • 1 vCPU

  • 2 GB Memory

We can use either a Jupyter Notebook as our editor or a Workbench: feel free to choose your favorite.

To finalize set-up, select the Launch Session option as shown in the below figure.


Once the session is launched, import all the necessary libraries as shown below. We are using a Boston housing dataset with the help of scikit-learn datasets that are easily available to import and load.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error, r2_score

Next, load the dataset

As our example use case, we will build a supervised learning model that predicts housing prices in Boston. Every row in the dataset has 14 attributes .

boston = load_boston()
print (boston.DESCR) # Gives the clear information about the columns


Use the below code to print out the shape of the data and features

print (boston.feature_names)

There are 506 rows and 13 attributes in this dataset. 

Next, print the first few rows to see the data using the below commands

boston_dataset = pd.DataFrame(boston.data)
boston_dataset.columns = boston.feature_names

In this section we are predicting PRICE of the house. Hence we are treating PRICE as target/output variable

boston_dataset['PRICE'] = boston.target

Understand the distribution of the data using the below command


Exploratory Data Analysis:

Next, plot the data to understand the distribution using matplotlib. Visualization helps us to understand the correlation between various features in the dataset.

Below is code to visualize the relationship between  crime rate  and the house price

import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns
from matplotlib import rcParams
plt.scatter(boston_dataset.CRIM, boston_dataset.PRICE)
plt.xlabel("Per capita crime rate by town")
plt.ylabel("Price of the house")
plt.title("Relationship between crime rate and Price")


Observation: We can see that house prices are higher where crime rates are lower

Next, use the code below to observe relationship between  rooms per dwelling and the house price

plt.scatter(boston_dataset.RM, boston_dataset.PRICE)
plt.xlabel("Average number of rooms per dwelling")
plt.ylabel("Price of the house")
plt.title("Relationship between rooms per dwelling and Price")


Observation: We can see that house price has significant increase as the number of rooms per dwelling increases

Below is the code to visualize the relationship between pupil-teacher ratio and housing price

plt.scatter(boston_dataset.PTRATIO, boston_dataset.PRICE)
plt.xlabel("Pupil-teacher ratio by town")
plt.ylabel("Price of the house")
plt.title("Relationship between PTRATIO and Price")


It’s now your turn to try building a few more visualizations using the other features in the dataset and observe the distribution.

Now let’s perform a simple linear regression on the dataset, considering rooms per dwelling as the input variable and housing price as the target/output variable. Split the data into training and testing set using scikit-learn train_test_split function. We are using 80% of the data for training and 20% for testing. Transform the data to fit into the model using the below code.

from sklearn.linear_model import LinearRegression
X = boston_dataset.RM
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(
    X, boston_dataset.PRICE, test_size=0.2, random_state = 5)
print (X_train.shape)
print (X_test.shape)
print (Y_train.shape)
print (Y_test.shape)
x_train_new= X_train.reshape(-1,1)
y_train_new= Y_train.reshape(-1,1)
x_test_new= X_test.reshape(-1,1)
y_test_new= Y_test.reshape(-1,1)

Instantiate linear regression object using the below code

linear_model = LinearRegression()

Next, fit the data to the linear regression model created earlier


Next, you can print the coefficient of the intercept used by the model for estimation

print ('Estimated intercept coefficient:', linear_model.intercept_)

Next, you can use testing set data to make predictions by the trained model

pred = linear_model.predict(x_test_new)

After we fit the model and make predictions used by the trained model, it’s time to observe the results of the model. In this use case we are going to calculate Mean absolute error(A measure of difference between two continuous variables. It is robust against the effect of outliers) , Mean squared error ( The average squared difference between the estimated values and the actual value. The MSE is a measure of the quality of an estimator) and R2 score(This metric describes the percentage of variance explained by the model. It ranges between 0 and 1 ). Use the below code to print the results.

print("Mean absolute error: %.2f" % np.mean(np.absolute(pred - y_test_new)))
print("Residual sum of squares (MSE): %.2f" % np.mean((pred - y_test_new) ** 2))
print("R2-score: %.2f" % r2_score(pred , y_test_new) )

As you have now understood how simple linear regression is performed, let’s perform multiple linear regression.

Implementation of multiple linear regression

Input features being used: we are using all the columns except the target variable PRICE. 

Let’s define input and output variables to fit the model. Again, we are using 80% of the data for training and 20% for testing.

X = boston_dataset.drop('PRICE', axis = 1)
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(
X, boston_dataset.PRICE, test_size=0.2, random_state = 5)
# This creates a LinearRegression object
linear_model = LinearRegression()

Next, fit the model and predict the output for testing set

pred = linear_model.predict(X_test)

Next, we have to evaluate the results

Once the model is built it is important to evaluate the model using evaluation metrics. Similar to  simple linear regression, we are again calculating MSE,MAE and R2-score.

print("Mean absolute error: %.2f" % np.mean(np.absolute(pred - Y_test)))
print("Residual sum of squares (MSE): %.2f" % np.mean((pred - Y_test) ** 2))
print("R2-score: %.2f" % r2_score(pred , Y_test) )

You can observe significant improvement in the R2 score by applying multiple linear regression. It is rare that a dependent variable is explained by only one variable. Hence analysts make use of multiple variables to make better decisions based on the use case.

Additional evaluation metrics which can be used:

Adjusted R²:  R square assumes that every single variable explains the variation in the dependent variable. The adjusted R square tells you the percentage of variation explained by the independent variables that actually affect the dependent variable.A model that includes several predictors will return higher R square values and may seem to be a better fit. However, this result is due to it including more terms.

The adjusted R-squared compensates for the addition of variables and only increases if the new predictor enhances the model above what would be obtained by probability. Conversely, it will decrease when a predictor improves the model less than what is predicted by chance.

Root Mean Square Error (RMSE): Interpreted as how far, on average, the residuals are from zero. It nullifies the squared effect of MSE by square root and provides the result in original units as data. Again, the lower the better.

3. Advanced section: mathematical approach

In this section we will implement the cost function and gradient descent using the mathematical approach. As you have already understood how cost function and gradient descent work, let’s see how to implement in Python. We are using the simple employee dataset which consists of employee years of experience and their corresponding salaries. Download the dataset from here.

Import necessary libraries using below code:

import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from numpy import genfromtxt

Load the data and define input and output variables. Here we are treating years of experience as X and salary as y

my_data = genfromtxt('Salary_Data.csv', delimiter=',',skip_header=1 )
X = np.c_[np.ones(my_data.shape[0]),my_data[:,0]]
y = np.c_[my_data[:,1]]
m = y.size 

Defining the hypothesis function (assumes simplified hypothesis with β0=0 for easy visualization):

def h(beta,X): 
    return np.dot(X,beta) 

Defining the cost function: 


t represents number of training examples

(Xi, yi) denotes training example

def computeCost(val_beta,X,y): 
    return float((1./(2*m)) * np.dot((h(val_beta,X)-y).T,(h(val_beta,X)-y)))

Here the intial_beta is assigned to zero (both β0,β1)

initial_beta = np.zeros((X.shape[1],1))
print (computeCost(initial_beta,X,y))



Implementation of gradient descent to perform iterations obtaining global minima

Gradient descent algorithm works as below. We will follow the algorithm for each example in the dataset (t) and perform simultaneous update on both parameters.

In each iteration we will reset βj to βj minus the learning rate (α) times the partial derivative of β vector with respect to βj . Below algorithm is used repetitively to find the global minima.


Let’s define number of iterations and learning rate(alpha). As shown in the above figure, we follow the gradient descent algorithm using the below code to update parameters.

iterations = 500
lr= 0.01
#Initial parameters are set to 0.
iterations= 500
alpha= 0.01
def descendGradient(X, beta_start = np.zeros(2)):
    beta = beta_start
    costvec = [] #Used to plot cost as function of iteration
    betavalues = [] 
    for val in range(iterations):
        tmpbeta = beta
        #Simultaneously updating theta values
        for j in range(len(tmpbeta)):
            tmpbeta[j] = beta[j] - (alpha/m)*np.sum((h(beta,X) - y)*np.array(X[:,j]).reshape(m,1))
        beta = tmpbeta
    return beta, betavalues, costvec
initial_beta = np.zeros((X.shape[1],1))
beta, betavalues, costvec = descendGradient(X,initial_beta)

Fit the line with the parameter values obtained

def fit(xval):
    return beta[0] + beta[1]*xval
plt.plot(X[:,1],y[:,0],'rx',markersize=10,label='Training Data')
plt.plot(X[:,1],fit(X[:,1]),'b-',label = 'Hypothesis: h(x) = %0.2f + %0.2fx'%(beta[0],beta[1]))
plt.ylabel('Years of experience')


As we have already calculated the parameter values using the gradient descent algorithm, below code helps to visualize cost minimization path followed by the algorithm

from mpl_toolkits.mplot3d import axes3d, Axes3D
from matplotlib import cm
import itertools
fig = plt.figure(figsize=(20,20))
ax = fig.gca(projection='3d')
xvals = np.arange(-10,10,.5)
yvals = np.arange(-1,4,.1)
xs, ys, zs = [], [], []
for a in xvals:
    for b in yvals:
        zs.append(computeCost(np.array([[a], [b]]),X,y))
scat = ax.scatter(xs,ys,zs,c=np.abs(zs),cmap=plt.get_cmap('YlOrRd'))
plt.title('Cost Minimization Path',fontsize=30)
plt.plot([x[0] for x in betavalues],[x[1] for x in betavalues],costvec,'bo-')



Congratulations! You have learned the concepts behind building a linear regression model. You have also learned how to perform linear and multiple linear regression using scikit-learn libraries, and also reviewed an advanced section to compute cost function and gradient descent.


Your form submission has failed.

This may have been caused by one of the following:

  • Your request timed out
  • A plugin/browser extension blocked the submission. If you have an ad blocking plugin please disable it and close this message to reload the page.