This post looks at how to build a simple classification model with the python machine learning library scikit-learn. Building a simple classification model is fairly easy with sci-kit learn, and this post explores some of the default behaviour and sign-posts some extra work that we would want to to ensure robust predictions.

I’ve tried to strip the code to a minimum to keep things simple. The main steps are:

Get data – features and target
Create an instance of an sklearn classifier
Train the classifier using the data
Output predicted classifications

Steps for validation and optimisation have been left out.

Some Sample Data

I’m going to use one of the sample datasets that come with scikit-klearn to run a simple classification. The breast cancer dataset is a good example for looking at binary classification.

# Get sample dataset from sklearn datasets
from sklearn import datasets
cancer = datasets.load_breast_cancer()

These sample datasets with sklearn are useful for trying out things. If you have your own data then you can substitute them for the X and y variables in the following section.

A ‘real world’ data set is likely to need further preparation and cleaning, such as:

Handling categorical data
Cleaning data values
Standardising column names

Simple Classification

The following is a bare-bones example of classification with scikit-learn. I am using a random forest classifier, but you could change the code to try out other classifiers too.

The breast cancer example data are used as the X and y variables. X is by convention used to represent the ‘feature‘ data for the classification model – the characteristics that describe different cases or patients. y is used to represent the target variable – in this case whether the case was found to be cancerous or not.

In sci-kit learn X is a numpy array of shape (m,n), where m is the number of observations and n is the number of features in the feature data. In the breast cancer example m =569 and n = 30. y is a 1d numpy array of shape (m,). As before m = 569. If you are not familiar with numpy 1d and nd arrays you may want to read up on them [1], [2].

This code outputs the first few actual and predicted classes so you can get an idea for how how well the classifier is performing.

from sklearn.ensemble import RandomForestClassifier
X = cancer.data # (m,n) numpy array 
y = cancer.target # (m,) numpy array

# Create an instance of the classifier we want to use
clf = RandomForestClassifier()

clf.fit(X,y)
preds = clf.predict(X)

print(preds[:5,]) # Predicted classes
print(y[:5,]) # Actual classes

pred_proba = clf.predict_proba(X)
print(pred_proba[:5,0]) # Probability of zero
print(pred_proba[:5,1]) # Probability of one

Taking Things Further

This post has kept has shown only a simple classification approach. There are a number of other things you would probably want to include to check that your classification predictions are accurate and improving model performance.

These refinements include (but are not limited to):

scikit-learn Simple Classification

Some Sample Data

Simple Classification

Taking Things Further

Related