Save and Load Sci-kit Learn Models

By | June 18, 2018

Once you have trained a sci-kit learn model it is not obvious how you can deploy it and use it to score unseen data. This post shows you how to save and learn sci-kit learn models so you can execute it against unseen data.

Train Your Model

The first step is to train the model we want to deploy. In this example we will make a very simple model using the titanic data set. Read more about training a simple model with sci-kit learn.

import pandas as pd
from sklearn.linear_model import LogisticRegression
# Get data
data = pd.read_csv('http://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv')

# Prepare data for model
X = data[['Age', 'Pclass']]
y = data.Survived.values

# Create and fit model object
clf = LogisticRegression()
clf.fit(X,y)

# Check model score
clf.score(X,y)

Save The Model as a ‘Pickle’

A ‘pickle‘ file is a way that python can save a data structure to a file (similar to how you might save your progress in a computer game).

Sci-kit learn has its own functions for pickling using joblib which is typically faster when saving larger files.

To save a pickle file we can use ‘joblib.dump()’:

from sklearn.externals import joblib
# Output a pickle file for the model
joblib.dump(clf, 'saved_model.pkl') 

The resulting ‘saved_model.pkl’ is a file on disk made from the ‘clf’ object.

Load The Pickled Model

Once we have a saved pickle file, we can use joblib.load() to load it back in to python.

# Load the pickle file
clf_load = joblib.load('saved_model.pkl') 

The loaded pickle file becomes an object like any other in the python script.

We can do a simple check that the saved model and the loaded model yield the same performance:

# Check that the loaded model is the same as the original
assert clf.score(X, y) == clf_load.score(X, y)

Bringing It Together

Here is a combination of the previous snippets, in a single script, which also has a check that the saved and loaded models are doing the same thing.

import pandas as pd
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
# Get data
data = pd.read_csv('http://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv')

# Prepare data for model
X = data[['Age', 'Pclass']]
y = data.Survived.values

# Create and fit model object
clf = LogisticRegression()
clf.fit(X,y)

# Check model score
clf.score(X,y)


# Output a pickle file for the model
joblib.dump(clf, 'saved_model.pkl') 

# Load the pickle file
clf_load = joblib.load('saved_model.pkl') 

# Check that the loaded model is the same as the original
assert clf.score(X, y) == clf_load.score(X, y)