In this tutorial, you will learn about Support Vector Machine (SVM) using Cloudera Machine Learning (CML); an experience you get in Cloudera Data Platform (CDP). An SVM is a supervised algorithm that can classify different classes by finding a separator. It is a very useful algorithm when the data is not separable linearly. It can be used in a variety of applications, including intrusion detection, classification of emails, news articles, digit recognition, tumor identification, and so on. In this tutorial you will learn how to build an SVM model, how you can apply an SVM model, and about the underlying components of an SVM model. You will also get a chance to visualize and evaluate the results of an SVM model by using Python’s scikit-learn library. Finally, you will learn the basics of hyperparameter tuning while building an SVM model.
To better understand this tutorial, you should have a basic knowledge of statistics and linear algebra.
You should have a Python 3 session set up in CML before you start implementing the code.
Build SVM Model
A Support Vector Machine is a supervised machine learning model that can help classify different cases by finding a separator which is a hyperplane. In two-dimensional space, this hyperplane is a line dividing a plane into two parts, where each class lies on either one or the other side. SVMs can be applied both to regression and classification problems. An SVM has the capability to handle both continuous and discrete variables. For a dataset that is not linearly separable, as shown in Fig 1, the SVM maps the data to a high-dimensional space, which helps the model categorize the data into different classes. This process of construction of a hyperplane happens iteratively until the algorithm finds a maximum marginal hyperplane that best divides the dataset into distinct classes, where a margin is defined as the gap between two lines closest to the class points, as shown in Fig 2. The goal of the algorithm is to obtain the largest margin between two classes.
Fig1: An example of nonlinearly separable data
Fig 2: An example of an SVM algorithm finding best margins to separate two classes, A and B
The SVM algorithm takes an iterative approach to build different hyperplanes and chooses the one with the lowest classification error and maximum separation between nearest data points, as shown in Fig 3.
Fig 3: An example of an SVM algorithm iteratively drawing different lines to reduce classification error and maximize margin between two classes, A and B
Much of the benefit of SVMs comes from the fact that they are not restricted to being linear classifiers. SVMs use a technique known as the kernel, which makes them much more flexible by introducing various types of nonlinear decision boundaries. However, if in a feature space some of the sets are not linearly separable, then it is necessary to perform a mapping of the original feature space to a higher-dimensional space where the separation between groups is clear, as shown in Fig 4. In other words, it converts a nonlinearly separable problem into a separable problem by introducing another dimension to it.
Fig 4: Sample mapping describing how Fig 1 data is mapped onto a high-dimensional space by SVM to apply a decision boundary
Next, let’s consider some of the advantages and disadvantages of SVMs in general:
Effective for non linearly separable classes
Effective in cases where the number of dimensions is greater than the number of samples
Memory efficient, because SVMs use a subset of training points in the decision function (called support vectors)
Greater chance that support vectors perform very poorly for unseen examples when number of features exceeds the number of training examples, due to higher-dimensional space mapping
Poor performance when trained on noisy data (e.g., when target classes are overlapping)
Choosing right Kernel functions and regularization is crucial to avoid overfitting of data
SVMs do not directly provide probability estimates: these are calculated using the Platt scaling method, which is a way of transforming classification output into probability distribution. For example, if you’ve got the dependent variable as 0 & 1 in the training data set, you can convert it into probability using this method. For more details, check Scores and probabilities.
SVMs, due to their simplicity and explainability, can be used for image analysis tasks, such as image classification and handwritten digit recognition, and are also considered effective in the situation where fast compute requires a smaller set of resources, or in dealing with high-dimensional data such as text-mining tasks or gene expression data classification. Although neural networks have proved to be the best in achieving high accuracy, it is a matter of trade-offs: SVMs are less prone to overfitting, need less memory to store the predictive model, and have quick training and prediction time, whereas neural nets achieve state-of-the-art accuracy and can deal with huge amounts of data, but take a considerable amount of time. It is often important to consider the pros and cons of different models when applying to your use case. SVM can also be used for other types of machine learning problems, such as regression, outlier detection, and clustering.
Cloudera Machine Learning (CML) is a secure enterprise data science platform that enables data scientists to accelerate their workflow from exploration to production. CML allows data scientists to utilize already existing skills and tools, such as Python, R, and Scala, to run computations in Hadoop clusters. If you are new to CML, feel free to check out Tour of Data Science Work Bench to start using it and to set up your environment.
CML allows you to run your code as a session or a job. A session is a way to interpret your code interactively, whereas a job allows you to execute your code as a batch process and can be scheduled to run recursively.
In order for us to use the Python script needed for this tutorial, select a Python 3 engine with this resource allocation configuration:
2 GB Memory
0 GPU (It's okay if you don't have any, but it's great to know you can have them.)
We can use either a Jupyter Notebook as our editor or a Workbench: feel free to choose your favorite.
To finalize set-up, select the Launch Session option.
Step 1: Import necessary libraries required to run the program
#import all the necessary libraries from sklearn import datasets from sklearn import svm from sklearn.metrics import accuracy_score import numpy as np import matplotlib.pyplot as plt
Step 2: Import the dataset that will be used to build the model. We are using an Iris dataset that contains three classes of 50 instances; each refers to the type of iris plant. One class is linearly separable from the other two, whereas the latter two are not linearly separable from each other.
Data attributes: Sepal length in cm, sepal width in cm, petal width in cm, and petal length in cm.
Classes: Iris setosa, Iris versicolor, Iris virginica
# import data iris = datasets.load_iris()
Step 3: Exploratory data analysis
Observe the correlation between sepal length and sepal width. As you can see, applying linear regression to this kind of data is not effective and does not help to improve classification.
#function to visualize sepal data def visualize_sepal_data(): iris = datasets.load_iris() X = iris.data[:, :2] # we only take the first two features. y = iris.target plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm) plt.xlabel('Sepal length') plt.ylabel('Sepal width') plt.title('Sepal Width & Length') plt.show() visualize_sepal_data()
Fig 5: This plot describes the correlation between sepal width and sepal length. Three types of classes (Iris setosa, Iris versicolor, Iris virginica) are represented with three different colors (purple, green, yellow)
In Fig 5, you can observe a positive correlation between the sepal length and sepal width. Now let’s observe the correlation between the petal length and petal width of the flowers.
def visualize_petal_data(): iris = datasets.load_iris() X = iris.data[:, 2:] # we only take the last two features. y = iris.target plt.scatter(X[:, 0], X[:, 1], c=y) plt.xlabel('Petal length') plt.ylabel('Petal width') plt.title('Petal Width & Length') plt.show() visualize_petal_data()
Fig 6: This plot describes the correlation between petal width and petal length. Three types of classes (Iris setosa, Iris versicolor, Iris virginica) are represented with three different colors (purple, green, yellow)
In Fig 6, you can see that there is a strong positive correlation between petal length and petal width.
Step 4: Building an SVM model
Here as an example we are using first two features(sepal width and sepal length) to predict the type of class which the flower belongs to. You can also use last two features(petal length and petal width) to observe the results.
# Here as an example we are using sepal length and sepal width as features and target as species class X = iris.data[:, :2] y = iris.target #For using petal length and petal width as features uncomment the below code and use it in the place of above code #X = iris.data[:, 2:] #y = iris.target
Define a function to plot the SVM decision boundaries using various kernels.
def plotSVM(title): x_min, x_max = X[:, 0].min() -1, X[:, 0].max() + 1 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 h = (x_max / x_min)/100 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) plt.subplot(1, 1, 1) Z = svc.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.contour(xx, yy, Z,alpha=0.8) plt.scatter(X[:, 0], X[:, 1], c=y) plt.xlabel("Sepal length") plt.ylabel("Sepal width") plt.xlim(xx.min(), xx.max()) plt.title(title) plt.show()
Define a function to observe how different types of kernels work on the dataset.
Kernel functions measure similarity between two data points. For example, in a text-processing task, a good kernel function would assign a high score to a pair of similar strings and a low score to a pair of dissimilar strings. Kernel parameters describe the hyperplane used to separate the data. To use a linear hyperplane, we will use the ‘linear’ parameter. For a nonlinear hyperplane, you could use either an ‘rbf’ or a ‘poly’ parameter.
Next, let’s define kernels and fit the model with different kernels.
kernels = ["linear", "rbf", "poly"] for kernel in kernels: svc = svm.SVC(kernel=kernel).fit(X, y) plotSVM("kernel=" + str(kernel))
Linear kernel is used when the data is linearly separable: that is, it can be separated using a single line. It is one of the most common kernels used in SVM. RBF and polynomial kernels are used especially for non-linearly-separable data to create a complex decision boundary based on a high-dimensional feature mapping, but they still have an efficient computation because of the kernel representation.
Fig 7: Linear kernel built on Iris dataset with features sepal width and sepal length
Fig 8: RBF kernel built on Iris dataset with features sepal width and sepal length
Fig 9: Poly kernel built on Iris dataset with features sepal width and sepal length
Step 5: Hyperparameter tuning
Let’s observe the different kernels using different gamma values.
Gamma is a parameter for nonlinear hyperplanes. The higher the gamma value, the “tighter” fit on the training dataset. The code below will help you understand how different gamma values affect the fit of the data using decision boundaries. Generally, high gamma value leads to overfitting on data.
gammas = [0.1, 1, 10, 100] for gamma in gammas: svc = svm.SVC(kernel='rbf', gamma=gamma).fit(X, y) plotSVM('gamma=' + str(gamma))
Fig 10: Decision boundaries built using gamma value 0.1
Fig 11: Decision boundaries built using gamma value 1
Fig 12: Decision boundaries built using gamma value 10
Fig 13: Decision boundaries built using gamma value 100
Observing the C value
C is the penalty parameter of the error term. It controls the trade-off between a smooth decision boundary and classifying the training points correctly.
Note: You can observe negligible differences using this tiny iris dataset, but for real-time datasets you work with, you can easily observe differences by tuning this parameter using different values.
cs = [0.1, 1, 10, 100, 1000] for c in cs: svc = svm.SVC(kernel='rbf', C=c).fit(X, y) plotSVM('C=' + str(c))
Fig 14: Decision boundaries built using C value 0.1
Fig 15: Decision boundaries built using C value 10
Fig 16: Decision boundaries built using C value 100
Fig 17: Decision boundaries built using C value 1000
Step 6: Calculate the accuracy of the predictions using code below
lin_svc = svm.SVC(kernel='linear').fit(X, y) predictions = lin_svc.predict(iris.data[:, :2]) accuracy_score(predictions, iris.target)
The accuracy score helps to find out how the model is performing for the unseen data. You can observe about 81.9% accuracy for the testing set with the linear SVM kernel. Keep an eye on the Community articles page for future SVM articles on using different kernels for different use cases to improve the model’s performance.
Congratulations! You have learned core concepts and key parameters behind Support Vector Machine models. We’ve written sample code in Python’s scikit-learn library on Cloudera Machine Learning, visualizing key hyperparameters affecting SVM model performance.