scikit-learn Simple Classification

By | February 2, 2018

This post looks at how to build a simple classification model with the python machine learning library scikit-learn. Building a simple classification model is fairly easy with sci-kit learn, and this post explores some of the default behaviour and sign-posts some extra work that we would want to to ensure robust predictions.

I’ve tried to strip the code to a minimum to keep things simple. The main steps are:

  • Get data – features and target
  • Create an instance of an sklearn classifier
  • Train the classifier using the data
  • Output predicted classifications

Steps for validation and optimisation have been left out.

Some Sample Data

I’m going to use one of the sample datasets that come with scikit-klearn to run a simple classification. The breast cancer dataset is a good example for looking at binary classification.

# Get sample dataset from sklearn datasets
from sklearn import datasets
cancer = datasets.load_breast_cancer()

These sample datasets with sklearn are useful for trying out things. If you have your own data then you can substitute them for the X and y variables in the following section.

A ‘real world’ data set is likely to need further preparation and cleaning, such as:

  • Handling categorical data
  • Cleaning data values
  • Standardising column names

Simple Classification

The following is a bare-bones example of classification with scikit-learn. I am using a random forest classifier, but you could change the code to try out other classifiers too.

The breast cancer example data are used as the X and y variables. X is by convention used to represent the ‘feature‘ data for the classification model – the characteristics that describe different cases or patients. y is used to represent the target variable – in this case whether the case was found to be cancerous or not.

In sci-kit learn X is a numpy array of shape (m,n), where m is the number of observations and n is the number of features in the feature data. In the breast cancer example m =569 and n = 30. y is a 1d numpy array of shape (m,). As before m = 569. If you are not familiar with numpy 1d and nd arrays you may want to read up on them [1], [2].

This code outputs the first few actual and predicted classes so you can get an idea for how how well the classifier is performing.

from sklearn.ensemble import RandomForestClassifier
X = # (m,n) numpy array 
y = # (m,) numpy array

# Create an instance of the classifier we want to use
clf = RandomForestClassifier(),y)
preds = clf.predict(X)

print(preds[:5,]) # Predicted classes
print(y[:5,]) # Actual classes

pred_proba = clf.predict_proba(X)
print(pred_proba[:5,0]) # Probability of zero
print(pred_proba[:5,1]) # Probability of one


Taking Things Further

This post has kept has shown only a simple classification approach. There are a number of other things you would probably want to include to check that your classification predictions are accurate and improving model performance.

These refinements include (but are not limited to):