Find out where to find sample datasets for playing with data in Python. If you’re testing or validating a model or analysis for data science or machine learning, it can be useful to have some sample data to play with. R has the datasets package which makes loading sample datasets easy, but it’s not so obvious what to do in python – this post shows you some of the options.
Search for Python Data Science on Amazon
Load csv files from the internet
A simple way to get sample datasets in Python is to use the pandas ‘read_csv’ method to load them directly from the internet. To do this just put the address of your target csv dataset as the argument to read_csv:
import pandas as pd data = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
You can actually use this method to load the datasets found in the r datasets package – just copy the link to the csv files. It’s a bit clunkier than the R package, but it does give you easy access to the data.
Use The Seaborn Library
Seaborn is primarily a plotting library for python, but you can also use it to access sample datasets. The example below loads the iris dataset as a pandas dataframe (the iris dataset is also available in R).
import seaborn.apionly as sns iris = sns.load_dataset('iris')
Find out more about this method here.
Use the sklearn package
Sci-kit-learn is a popular machine learning package for python and, just like the seaborn package, sklearn comes with some sample datasets ready for you to play with.
You can access the sklearn datasets like this:
from sklearn.datasets import load_iris iris = load_iris() data = iris.data column_names = iris.feature_names
If you like to use pandas to work with your data, then you’ll also want to do something like this to get the sklearn data into a pandas dataframe:
import pandas as pd df = pd.DataFrame(iris.data, iris.feature_names)
Let me know if you have any better ways to access sample datasets in python.
Great Article thanks for posting!