In this tutorial, you will learn about the linear regression model, using Cloudera Machine Learning (CML), an experience on Cloudera Data Platform (CDP). If you are aspired to become a data scientist, linear regression is the first algorithm you need to master. Linear regression is very simple to understand, and it is a very powerful algorithm that is used today by many firms to help with decision making.
Despite it being a very simple algorithm to start with, the optimization of the model is very important to gain higher accuracy. In this tutorial, you will learn important concepts of regression analysis using Python. By the end of this tutorial you’ll be able to understand how to build, visualize, and evaluate models on your own. We will be focusing on simple linear regression and multiple linear regression models in this tutorial.
To better understand this tutorial, you should have a basic knowledge of statistics and linear algebra.
You should have Python3 session setup in CML before you start implementing the code.
Understanding concepts behind linear regression
Implementation of linear regression using scikit-learn
Advanced section: mathematical approach
1.Understanding concepts behind linear regression
Linear regression is a technique for estimating linear relationships between various features and a continuous target variable. Regression means estimating a continuous real-value output. For example, if you have data that contains selling prices of houses in your city, you can estimate the selling price of your house based on that data and understand the market. Regression analysis is a subfield of Supervised Learning. Some of the questions that regression can answer if you are dealing with housing data are as follows:
How much more can I sell my house for with an additional bedroom and bathroom?
Do houses located near malls sell for more or less than others?
What is the impact of lot size on housing prices?
Let’s understand simple linear regression using an example: Suppose you have the data for employees’ years of experience and their corresponding salaries. In order to perform analysis on this data, you need to make sure you collect the right data, clean the data, and transform the data. It is very important to perform exploratory data analysis to observe the distribution of the data and analyze the patterns. Fig 1 shows the plotted employee data, in which the x axis corresponds to years of experience, and the y axis corresponds to salary. Each point (x,y) denotes a training example. Let’s consider ‘t’ to be the total number of examples in the dataset.
You are going to fit a line in the form of y = β0 + β1x to your data. Here we call x the independent variable or input variable; y is called the dependent variable or output variable. β1 represents the slope of the line; β0 represents the intercept of the line. The goal of the algorithm is to find a line that best fits the set of points to minimize the error (the error is the distance from the line to the points). The line moves according to the parameter values (β0,β1).
Key terminology to understand when building a linear model:
Hypothesis function refers to the equation h(x)= β0 + β1x which maps input x (years of experience) to target y (salary). Now the question is how do we find the best parameters to fit the line with minimum error? The concepts below help us to answer this question.
Cost Function: The cost function determines how to fit the best line to the data. For hypothesis function h(x) = β0 + β1x choose β0,β1 so that h(x) is close to y for training examples (x,y). This represents the optimization problem to solve. Regression uses ordinary least squares methodology which is the summation of the squared difference between actual value(y) and the predicted value(y') which can be denoted as ∑(y - y')² for all training examples t in the dataset to find the best possible value of the regression coefficients β0, β1. You might question why we are squaring the error instead of the absolute value: because squared error has nice mathematical properties that make it easier to differentiate and compute gradient descent, and the least squares method is easy to analyze and interpret. The cost function for linear regression is represented as:
1/(2t) ∑([h(x) - y']² for all training examples(t)
Here t represents the number of training examples in the dataset, h(x) represents the hypothesis function defined earlier ( β0 + β1x), and y' represents predicted value. The average is taken for the cost function for easy math calculations.
Gradient descent is one of the methods that can be used to reduce the error, which helps by taking steps in the direction of a negative gradient. Gradient descent is an optimization algorithm that tweaks its parameters iteratively. In machine learning, gradient descent is used to update parameters in a model. Let us relate gradient descent with a real-life analogy for better understanding. Think of a valley you would like to descend, you first will take a step and look for the slope of the valley, whether it goes up or down. Once you are sure of the downward slope you will follow that and repeat the step again and again until you have descended completely (or reached the minima).Gradient descent does exactly the same.
Learning rate: Gradient descent is vector valued function hence it has both magnitude and dimension. The algorithm multiples the gradient by a number(learning rate) to determine the next point to take a step. For example having a gradient with a magnitude 5.3 and learning rate 0.01 then the gradient descent algorithm will pick the next point 0.053 away from the previous point. Choosing the learning rate is an important step as big learning rate can skip the minima and very small learning rate can take infinite time to reach the minima as shown in the below figure.
2. Implementation of linear regression using scikit-learn
Cloudera Machine Learning (CML) is a secure enterprise data science platform that enables data scientists to accelerate their workflow from exploration to production by providing them with their own analytics pipelines. CML allows data scientists to utilize already existing skills and tools, such as Python, R, and Scala, to run computations in Hadoop clusters. If you are new to CML, feel free to check out Tour of Data Science Work Bench to start using it and to set up your environment.
CML allows you to run your code as a session or a job. A session is a way to interpret your code interactively, whereas a job allows you to execute your code as a batch process and can be scheduled to run recursively.
In order for us to use the Python script needed for this tutorial, select a Python 3 engine with this resource allocation configuration:
2 GB Memory
We can use either a Jupyter Notebook as our editor or a Workbench: feel free to choose your favorite.
To finalize set-up, select the Launch Session option as shown in the below figure.
Once the session is launched, import all the necessary libraries as shown below. We are using a Boston housing dataset with the help of scikit-learn datasets that are easily available to import and load.
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.datasets import load_boston from sklearn.metrics import mean_squared_error, r2_score
Next, load the dataset
As our example use case, we will build a supervised learning model that predicts housing prices in Boston. Every row in the dataset has 14 attributes .
boston = load_boston() boston.keys() print (boston.DESCR) # Gives the clear information about the columns
Use the below code to print out the shape of the data and features
boston.data.shape print (boston.feature_names)
There are 506 rows and 13 attributes in this dataset.
Next, print the first few rows to see the data using the below commands
boston_dataset = pd.DataFrame(boston.data) boston_dataset.columns = boston.feature_names boston_dataset.head()
In this section we are predicting PRICE of the house. Hence we are treating PRICE as target/output variable
boston_dataset['PRICE'] = boston.target boston_dataset.head()
Understand the distribution of the data using the below command
Exploratory Data Analysis:
Next, plot the data to understand the distribution using matplotlib. Visualization helps us to understand the correlation between various features in the dataset.
Below is code to visualize the relationship between crime rate and the house price
import numpy as np import pandas as pd import scipy.stats as stats import matplotlib.pyplot as plt import sklearn import seaborn as sns from matplotlib import rcParams plt.scatter(boston_dataset.CRIM, boston_dataset.PRICE) plt.xlabel("Per capita crime rate by town") plt.ylabel("Price of the house") plt.title("Relationship between crime rate and Price")
Observation: We can see that house prices are higher where crime rates are lower
Next, use the code below to observe relationship between rooms per dwelling and the house price
plt.scatter(boston_dataset.RM, boston_dataset.PRICE) plt.xlabel("Average number of rooms per dwelling") plt.ylabel("Price of the house") plt.title("Relationship between rooms per dwelling and Price")
Observation: We can see that house price has significant increase as the number of rooms per dwelling increases
Below is the code to visualize the relationship between pupil-teacher ratio and housing price
plt.scatter(boston_dataset.PTRATIO, boston_dataset.PRICE) plt.xlabel("Pupil-teacher ratio by town") plt.ylabel("Price of the house") plt.title("Relationship between PTRATIO and Price")
It’s now your turn to try building a few more visualizations using the other features in the dataset and observe the distribution.
Now let’s perform a simple linear regression on the dataset, considering rooms per dwelling as the input variable and housing price as the target/output variable. Split the data into training and testing set using scikit-learn train_test_split function. We are using 80% of the data for training and 20% for testing. Transform the data to fit into the model using the below code.
from sklearn.linear_model import LinearRegression X = boston_dataset.RM X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split( X, boston_dataset.PRICE, test_size=0.2, random_state = 5) print (X_train.shape) print (X_test.shape) print (Y_train.shape) print (Y_test.shape) x_train_new= X_train.reshape(-1,1) y_train_new= Y_train.reshape(-1,1) x_test_new= X_test.reshape(-1,1) y_test_new= Y_test.reshape(-1,1)
Instantiate linear regression object using the below code
linear_model = LinearRegression()
Next, fit the data to the linear regression model created earlier
Next, you can print the coefficient of the intercept used by the model for estimation
print ('Estimated intercept coefficient:', linear_model.intercept_)
Next, you can use testing set data to make predictions by the trained model
pred = linear_model.predict(x_test_new)
After we fit the model and make predictions used by the trained model, it’s time to observe the results of the model. In this use case we are going to calculate Mean absolute error(A measure of difference between two continuous variables. It is robust against the effect of outliers) , Mean squared error ( The average squared difference between the estimated values and the actual value. The MSE is a measure of the quality of an estimator) and R2 score(This metric describes the percentage of variance explained by the model. It ranges between 0 and 1 ). Use the below code to print the results.
print("Mean absolute error: %.2f" % np.mean(np.absolute(pred - y_test_new))) print("Residual sum of squares (MSE): %.2f" % np.mean((pred - y_test_new) ** 2)) print("R2-score: %.2f" % r2_score(pred , y_test_new) )
As you have now understood how simple linear regression is performed, let’s perform multiple linear regression.
Implementation of multiple linear regression
Input features being used: we are using all the columns except the target variable PRICE.
Let’s define input and output variables to fit the model. Again, we are using 80% of the data for training and 20% for testing.
X = boston_dataset.drop('PRICE', axis = 1) X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split( X, boston_dataset.PRICE, test_size=0.2, random_state = 5) # This creates a LinearRegression object linear_model = LinearRegression()
Next, fit the model and predict the output for testing set
linear_model.fit(X_train,Y_train) pred = linear_model.predict(X_test)
Next, we have to evaluate the results
Once the model is built it is important to evaluate the model using evaluation metrics. Similar to simple linear regression, we are again calculating MSE,MAE and R2-score.
print("Mean absolute error: %.2f" % np.mean(np.absolute(pred - Y_test))) print("Residual sum of squares (MSE): %.2f" % np.mean((pred - Y_test) ** 2)) print("R2-score: %.2f" % r2_score(pred , Y_test) )
You can observe significant improvement in the R2 score by applying multiple linear regression. It is rare that a dependent variable is explained by only one variable. Hence analysts make use of multiple variables to make better decisions based on the use case.
Additional evaluation metrics which can be used:
Adjusted R²: R square assumes that every single variable explains the variation in the dependent variable. The adjusted R square tells you the percentage of variation explained by the independent variables that actually affect the dependent variable.A model that includes several predictors will return higher R square values and may seem to be a better fit. However, this result is due to it including more terms.
The adjusted R-squared compensates for the addition of variables and only increases if the new predictor enhances the model above what would be obtained by probability. Conversely, it will decrease when a predictor improves the model less than what is predicted by chance.
Root Mean Square Error (RMSE): Interpreted as how far, on average, the residuals are from zero. It nullifies the squared effect of MSE by square root and provides the result in original units as data. Again, the lower the better.
3. Advanced section: mathematical approach
In this section we will implement the cost function and gradient descent using the mathematical approach. As you have already understood how cost function and gradient descent work, let’s see how to implement in Python. We are using the simple employee dataset which consists of employee years of experience and their corresponding salaries. Download the dataset from here.
Import necessary libraries using below code:
import numpy as np import matplotlib.pyplot as plt import numpy as np import pandas as pd from numpy import genfromtxt
Load the data and define input and output variables. Here we are treating years of experience as X and salary as y
my_data = genfromtxt('Salary_Data.csv', delimiter=',',skip_header=1 ) X = np.c_[np.ones(my_data.shape),my_data[:,0]] y = np.c_[my_data[:,1]] m = y.size
Defining the hypothesis function (assumes simplified hypothesis with β0=0 for easy visualization):
def h(beta,X): return np.dot(X,beta)
Defining the cost function:
t represents number of training examples
(Xi, yi) denotes training example
def computeCost(val_beta,X,y): return float((1./(2*m)) * np.dot((h(val_beta,X)-y).T,(h(val_beta,X)-y)))
Here the intial_beta is assigned to zero (both β0,β1)
initial_beta = np.zeros((X.shape,1)) print (computeCost(initial_beta,X,y))
Implementation of gradient descent to perform iterations obtaining global minima
Gradient descent algorithm works as below. We will follow the algorithm for each example in the dataset (t) and perform simultaneous update on both parameters.
In each iteration we will reset βj to βj minus the learning rate (α) times the partial derivative of β vector with respect to βj . Below algorithm is used repetitively to find the global minima.
Let’s define number of iterations and learning rate(alpha). As shown in the above figure, we follow the gradient descent algorithm using the below code to update parameters.
iterations = 500 lr= 0.01 #Initial parameters are set to 0. iterations= 500 alpha= 0.01 def descendGradient(X, beta_start = np.zeros(2)): beta = beta_start costvec =  #Used to plot cost as function of iteration betavalues =  for val in range(iterations): tmpbeta = beta costvec.append(computeCost(beta,X,y)) betavalues.append(list(beta[:,0])) #Simultaneously updating theta values for j in range(len(tmpbeta)): tmpbeta[j] = beta[j] - (alpha/m)*np.sum((h(beta,X) - y)*np.array(X[:,j]).reshape(m,1)) beta = tmpbeta return beta, betavalues, costvec initial_beta = np.zeros((X.shape,1)) beta, betavalues, costvec = descendGradient(X,initial_beta)
Fit the line with the parameter values obtained
def fit(xval): return beta + beta*xval plt.figure(figsize=(10,6)) plt.plot(X[:,1],y[:,0],'rx',markersize=10,label='Training Data') plt.plot(X[:,1],fit(X[:,1]),'b-',label = 'Hypothesis: h(x) = %0.2f + %0.2fx'%(beta,beta)) plt.grid(True) plt.ylabel('Years of experience') plt.xlabel('Salary') plt.legend()
As we have already calculated the parameter values using the gradient descent algorithm, below code helps to visualize cost minimization path followed by the algorithm
from mpl_toolkits.mplot3d import axes3d, Axes3D from matplotlib import cm import itertools fig = plt.figure(figsize=(20,20)) ax = fig.gca(projection='3d') xvals = np.arange(-10,10,.5) yvals = np.arange(-1,4,.1) xs, ys, zs = , ,  for a in xvals: for b in yvals: xs.append(a) ys.append(b) zs.append(computeCost(np.array([[a], [b]]),X,y)) scat = ax.scatter(xs,ys,zs,c=np.abs(zs),cmap=plt.get_cmap('YlOrRd')) plt.xlabel(r'$\beta_0$',fontsize=30) plt.ylabel(r'$\beta_1$',fontsize=30) plt.title('Cost Minimization Path',fontsize=30) plt.plot([x for x in betavalues],[x for x in betavalues],costvec,'bo-') plt.show()
Congratulations! You have learned the concepts behind building a linear regression model. You have also learned how to perform linear and multiple linear regression using scikit-learn libraries, and also reviewed an advanced section to compute cost function and gradient descent.