Find out where to find sample datasets for playing with data in Python. If you’re testing or validating a model or analysis for data science or machine learning, it can be useful to have some sample data to play with. R has the datasets package which makes loading sample datasets easy, but it’s not so obvious what to do in python – this post shows you some of the options.
Load csv files from the internet
A simple way to get sample datasets in Python is to use the pandas ‘read_csv’ method to load them directly from the internet. To do this just put the address of your target csv dataset as the argument to read_csv:
import pandas as pd data = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
You can actually use this method to load the datasets found in the r datasets package – just copy the link to the csv files. It’s a bit clunkier than the R package, but it does give you easy access to the data.
Use The Seaborn Library
Seaborn is primarily a plotting library for python, but you can also use it to access sample datasets. The example below loads the iris dataset as a pandas dataframe (the iris dataset is also available in R).
import seaborn.apionly as sns iris = sns.load_dataset('iris')
Find out more about this method here.
Use the sklearn package
You can access the sklearn datasets like this:
from sklearn.datasets import load_iris iris = load_iris() data = iris.data column_names = iris.feature_names
If you like to use pandas to work with your data, then you’ll also want to do something like this to get the sklearn data into a pandas dataframe:
import pandas as pd df = pd.DataFrame(iris.data, iris.feature_names)
Let me know if you have any better ways to access sample datasets in python.