Python Sample Datasets for Datascience and Machine Learning

By | November 11, 2016

Find out where to find sample datasets for playing with data in Python. If you’re testing or validating a model or analysis for data science or machine learning, it can be useful to have some sample data to play with. R has the datasets package which makes loading sample datasets easy, but it’s not so obvious what to do in python – this post shows you some of the options.

Search for Python Data Science on Amazon

Load csv files from the internet

A simple way to get sample datasets in Python is to use the pandas ‘read_csv’ method to load them directly from the internet. To do this just put the address of your target csv dataset as the argument to read_csv:

import pandas as pd
data = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")

You can actually use this method to load the datasets found in the r datasets package – just copy the link to the csv files. It’s a bit clunkier than the R package, but it does give you easy access to the data.

Use The Seaborn Library

Seaborn is primarily a plotting library for python, but you can also use it to access sample datasets. The example below loads the iris dataset as a pandas dataframe (the iris dataset is also available in R).

import seaborn.apionly as sns
iris = sns.load_dataset('iris')

Find out more about this method here.

Use the sklearn package

Sci-kit-learn is a popular machine learning package for python and, just like the seaborn package, sklearn comes with some sample datasets ready for you to play with.

You can access the sklearn datasets like this:

from sklearn.datasets import load_iris
iris = load_iris()
data = iris.data
column_names = iris.feature_names

If you like to use pandas to work with your data, then you’ll also want to do something like this to get the sklearn data into a pandas dataframe:

import pandas as pd
df = pd.DataFrame(iris.data, iris.feature_names)

Let me know if you have any better ways to access sample datasets in python.

One thought on “Python Sample Datasets for Datascience and Machine Learning

Comments are closed.