Once you have trained a sci-kit learn model it is not obvious how you can deploy it and use it to score unseen data. This post shows you how to save and learn sci-kit learn models so you can execute it against unseen data.
Train Your Model
The first step is to train the model we want to deploy. In this example we will make a very simple model using the titanic data set. Read more about training a simple model with sci-kit learn.
import pandas as pd from sklearn.linear_model import LogisticRegression # Get data data = pd.read_csv('http://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv') # Prepare data for model X = data[['Age', 'Pclass']] y = data.Survived.values # Create and fit model object clf = LogisticRegression() clf.fit(X,y) # Check model score clf.score(X,y)
Save The Model as a ‘Pickle’
A ‘pickle‘ file is a way that python can save a data structure to a file (similar to how you might save your progress in a computer game).
Sci-kit learn has its own functions for pickling using joblib which is typically faster when saving larger files.
To save a pickle file we can use ‘joblib.dump()’:
from sklearn.externals import joblib # Output a pickle file for the model joblib.dump(clf, 'saved_model.pkl')
The resulting ‘saved_model.pkl’ is a file on disk made from the ‘clf’ object.
Load The Pickled Model
Once we have a saved pickle file, we can use joblib.load() to load it back in to python.
# Load the pickle file clf_load = joblib.load('saved_model.pkl')
The loaded pickle file becomes an object like any other in the python script.
We can do a simple check that the saved model and the loaded model yield the same performance:
# Check that the loaded model is the same as the original assert clf.score(X, y) == clf_load.score(X, y)
Bringing It Together
Here is a combination of the previous snippets, in a single script, which also has a check that the saved and loaded models are doing the same thing.
import pandas as pd from sklearn.externals import joblib from sklearn.linear_model import LogisticRegression # Get data data = pd.read_csv('http://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv') # Prepare data for model X = data[['Age', 'Pclass']] y = data.Survived.values # Create and fit model object clf = LogisticRegression() clf.fit(X,y) # Check model score clf.score(X,y) # Output a pickle file for the model joblib.dump(clf, 'saved_model.pkl') # Load the pickle file clf_load = joblib.load('saved_model.pkl') # Check that the loaded model is the same as the original assert clf.score(X, y) == clf_load.score(X, y)