Classification with SktLearn


ScikitLearn continues to be one of the popular machine learning libraries in Python. It provides ready modules for implementing most machine learning algorithms. These are very simple to use, yet they are extremely configurable.

Just as all machine learning code, SktLearn is concept heavy and code lite. The code is trivially small. Just a couple of lines will do the job. But it is very important that we understand the core concepts before we can use it meaningfully. To get a feel of how it works, let us go through a classification problem.

The Iris Dataset


This is a small data set of just 150 records. That may not be enough for solving a real life problem. But, because of the small size, it is often quoted in the academic world. It refers to classification of a flower called Iris. There are three major species of this plant - Iris Setosa, Iris Versicolour, Iris Virginica. The Iris data set measures the petal and sepals of the flowers - to provide data for a classification.

For 150 samples of flowers of these species, it provides the following data

  • Sepal length in cm
  • Sepal width in cm
  • Petal length in cm
  • Petal width in cm
  • Species

Just search on Google for the Iris Data Set and we can find many copies of this data referred all over the machine learning literature. We can download the data set in a CSV file on our disk - so that we can play around with it.

Analyze the Data


Before jumping into the implementation of the machine learning model, it is always a good practice to peep into the data (at least a small sample), to understand how it looks. This can provide a great help in deciding which algorithm could be more suitable. So, let us work on that.

Let us start with importing the necessary Python modules

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

We need to configure the Seaborn and MatplotLib - so that we can see the output on the JupyterNotebook that we use for the work. We can also set the Seaborn pallet to get the colors of our choice

sns.set_palette('husl')
%matplotlib inline

With this in place, let us load the Iris data into a Pandas Data Frame. We need to give the appropriate path to the CSV file so that it can be loaded correctly.

data = pd.read_csv('Iris.csv')

The next step is to look at the data. 150 records may be a manageable count. But in a typical problem, we can expect a lot more than that. We cannot read through all the records. But peeping into the first few records is quite helpful.

data.head()
  Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa

Note that the column Id comes from the file and the Sr number is from the Pandas Data Frame.

Next, we can check out the data in a wider scope. What are the data types?, Do we have any missing values?, etc. With this information, we can take some implementation level decisions.

data.info()

This generates output:


RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
Id               150 non-null int64
SepalLengthCm    150 non-null float64
SepalWidthCm     150 non-null float64
PetalLengthCm    150 non-null float64
PetalWidthCm     150 non-null float64
Species          150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.1+ KB

Next, we can look up some statistical information about the available data. What is the kind of values we have? What is the mean, variance, min / max values, percentiles, quartiles... These give us some more ideas about how the model should look.

data.describe()

This gives us a table

  Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
count 150.000000 150.000000 150.000000 150.000000 150.000000
mean 75.500000 5.843333 3.054000 3.758667 1.198667
std 43.445368 0.828066 0.433594 1.764420 0.763161
min 1.000000 4.300000 2.000000 1.000000 0.100000
25% 38.250000 5.100000 2.800000 1.600000 0.300000
50% 75.500000 5.800000 3.000000 4.350000 1.300000
75% 112.750000 6.400000 3.300000 5.100000 1.800000
max 150.000000 7.900000 4.400000 6.900000 2.500000

We can also check out the number of samples of each type. Do we have enough representation of each type? If each type is not sufficiently represented in the data set, the model will be naturally biased against that type.

data['Species'].value_counts()

We get the output

Iris-virginica     50
Iris-versicolor    50
Iris-setosa        50
Name: Species, dtype: int64

Note that this is an academic data set and so everything is just what it should be! In a real scenario, we will never get things laid out so well.

Plotting the Data


Numbers don't speak enough. To get a better feel of the data, and dig deeper, we can use plots. These are visual representations of the available data. These help us develop a mental picture of the data. In a real case study, we would work on a small sample set of the available data. But in this case, we can use the entire data set.

Pair Plot


The pair plots are a helpful way to identify how the different input features are related to each other, and to the final outcome. Here, we get a graphical representation of each combination of two input features

tmp = data.drop('Id', axis=1)
g = sns.pairplot(tmp, hue='Species', markers='+')
plt.show()

Note that some combinations give a clear separation of the data into different classes, while others are mixed up. This gives us a hint about which features are enough to do the job. In case of a performance issue or an extreme overfitting, we might have to drop certain features to make things work better. Pair plots can easily point out the redundant ones.

Violin Plots


Another useful representation is the violin plot. This helps us identify how the data is laid out with respect to a given input feature. For example, if we plot the data by the sepal length:

g = sns.violinplot(y='Species', x='SepalLengthCm', data=data, inner='quartile')
plt.show()

The dotted lines give us an idea about the quartiles (25%, 50% and 75%).

Similarly, we can check up the violin plots for the other features:

g = sns.violinplot(y='Species', x='SepalWidthCm', data=data, inner='quartile')
plt.show()
g = sns.violinplot(y='Species', x='PetalLengthCm', data=data, inner='quartile')
plt.show()
g = sns.violinplot(y='Species', x='PetalWidthCm', data=data, inner='quartile')
plt.show()

Classification


Having developed an idea about the data available, we can now jump into building the classifier. Based on the analysis of the data and the plots we saw above, we can note that Iris-setosa is reasonably separated from the others. The others are quite close to each other until we consider the petal as well as the sepal data. The prior suggests a good case for decision trees and the latter suggests regression.

In this case, we can check up many different algorithms and check out how they look. SktLearn allows us to configure hyperparameters for each of these algorithms - which can greatly impact the performance. But for now, we will just use the defaults and see how this works.

First, we need to rearrange and split the data to get things setup. Let us start with importing the necessary modules

from sklearn import metrics
from sklearn.model_selection import train_test_split

Now we should separate the input and output values. The sepal/petal width/length are the input features. The class is the output.

X = data.drop(['Id', 'Species'], axis=1)
y = data['Species']

Next, we need to split the data into the train and test sets.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=5)

Let us now start with individual algorithms

  • Logistic Regression
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, y_train)
print('Accuracy on the training subset: {:.3f}'.format(clf.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(clf.score(X_test, y_test)))

This reports the training set accuracy 0.944 and test set accuracy as 0.933

  • Nearest Neighbor
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
print('Accuracy on the training subset: {:.3f}'.format(clf.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(clf.score(X_test, y_test)))

This reports the training set accuracy 0.967 and test set accuracy as 0.967

  • Decision Tree
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf.fit(X_train, y_train)
print('Accuracy on the training subset: {:.3f}'.format(clf.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(clf.score(X_test, y_test)))

This reports the training set accuracy 1 and test set accuracy as 0.95

  • Random Forest
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
print('Accuracy on the training subset: {:.3f}'.format(clf.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(clf.score(X_test, y_test)))

This reports the training set accuracy 0.978 and test set accuracy as 0.967. Note that the Decision Tree model was badly overfitting the data. But the Random Forest gave much better results.

  • Support Vector Machine
from sklearn import svm
clf = svm.SVC()
clf.fit(X_train, y_train)
print('Accuracy on the training subset: {:.3f}'.format(clf.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(clf.score(X_test, y_test)))

This reports the training set accuracy 0.978 and test set accuracy as 0.983

  • Stochastic Gradient Descent
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier()
clf.fit(X_train, y_train)
print('Accuracy on the training subset: {:.3f}'.format(clf.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(clf.score(X_test, y_test)))

This reports the training set accuracy 0.689 and test set accuracy as 0.683

This is pathetic! Also, if we run this multiple times, we can notice a huge variation in the results. We can expect this because the data is too small. SGD is very good for high volume flowing data. We do not need all the data at once. But we do need a good amount of data.

  • Gaussian Processes
from sklearn.gaussian_process import GaussianProcessClassifier
clf = GaussianProcessClassifier()
clf.fit(X_train, y_train)
print('Accuracy on the training subset: {:.3f}'.format(clf.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(clf.score(X_test, y_test)))

This reports the training set accuracy 0.944 and test set accuracy as 0.967

  • Multi Layer Perceptrons
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier()
clf.fit(X_train, y_train)
print('Accuracy on the training subset: {:.3f}'.format(clf.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(clf.score(X_test, y_test)))

This reports the training set accuracy 0.989 and test set accuracy as 0.950

Please note that these are sample evaluations on a very small data set. So not every observation would hold on in a real life scenario.