This post looks at how to build a simple classification model with the python machine learning library scikit-learn. Building a simple classification model is fairly easy with sci-kit learn, and this post explores some of the default behaviour and sign-posts some extra work that we would want to to ensure robust predictions.
I’ve tried to strip the code to a minimum to keep things simple. The main steps are:
- Get data – features and target
- Create an instance of an sklearn classifier
- Train the classifier using the data
- Output predicted classifications
Steps for validation and optimisation have been left out.
Some Sample Data
I’m going to use one of the sample datasets that come with scikit-klearn to run a simple classification. The breast cancer dataset is a good example for looking at binary classification.
# Get sample dataset from sklearn datasets from sklearn import datasets cancer = datasets.load_breast_cancer()
These sample datasets with sklearn are useful for trying out things. If you have your own data then you can substitute them for the X and y variables in the following section.
A ‘real world’ data set is likely to need further preparation and cleaning, such as:
- Handling categorical data
- Cleaning data values
- Standardising column names
Simple Classification
The following is a bare-bones example of classification with scikit-learn. I am using a random forest classifier, but you could change the code to try out other classifiers too.
The breast cancer example data are used as the X and y variables. X is by convention used to represent the ‘feature‘ data for the classification model – the characteristics that describe different cases or patients. y is used to represent the target variable – in this case whether the case was found to be cancerous or not.
In sci-kit learn X is a numpy array of shape (m,n), where m is the number of observations and n is the number of features in the feature data. In the breast cancer example m =569 and n = 30. y is a 1d numpy array of shape (m,). As before m = 569. If you are not familiar with numpy 1d and nd arrays you may want to read up on them [1], [2].
This code outputs the first few actual and predicted classes so you can get an idea for how how well the classifier is performing.
from sklearn.ensemble import RandomForestClassifier X = cancer.data # (m,n) numpy array y = cancer.target # (m,) numpy array # Create an instance of the classifier we want to use clf = RandomForestClassifier() clf.fit(X,y) preds = clf.predict(X) print(preds[:5,]) # Predicted classes print(y[:5,]) # Actual classes pred_proba = clf.predict_proba(X) print(pred_proba[:5,0]) # Probability of zero print(pred_proba[:5,1]) # Probability of one
Taking Things Further
This post has kept has shown only a simple classification approach. There are a number of other things you would probably want to include to check that your classification predictions are accurate and improving model performance.
These refinements include (but are not limited to):
- Optimising the decision threshold
- Tuning hyperparameters
- Selecting the best classifier algorithm