Python Pandas for Physics
The Pandas package brings advanced data structures and data analysis tools to python. It was initially concieved for use in the world of finance, but can it be of any use to us interested in physics?
For some time I’ve been trying to get to the stage that I can use python for all my scripting and data analysis, replacing programs like Matlab and Origin. I’ve found python distributions that have a similar development environment to Matlab, and the search is now on for packages that simplify the process of data analysis, manipulation and visualisation.
My needs are relatively modest: I don’t need to do much more than load, manipulate and plot small data files. I wonder if such a powerful package as pandas can also satisfy these requirements.
Get the Pandas Python Package
You might also want to consider the Stoner python package – a ‘home-grown’ analysis package made at the university of Leeds. It’s not as polished or well supported as pandas, but it was built from the ground up as a physics analysis package.
Pandas For Physics
Much of the power of pandas comes from its ‘DataFrame‘ data structure. The columns of this 2d table can be labelled, and contain different data types.
To explore how it might be useful, I’ve run through a few common tasks I’ve encountered in physics research. For this I’m using the Anaconda python distribution.
For the impatient: Get the gist of this post on github
Loading data (from csv)
More or less any analysis task begins with loading from a data file. I’ve previously tried to find a simple. standard way to load data into python, but pandas seems to be one of the simplest so far:
mydata = pd.read_csv('data.dat')
More on loading data with pandas.
Display Columns
Once we’ve loaded the file into our python session, we can display it very easily too:
print(mydata.head())
The “.head()” just shows the first 5 lines of our data – useful if we are trying to read a large file!
Simple Statistics
It’s also a quick job to output basic statistics about the different columns of data in our file:
print(mydata.describe())
which will output statistics like mean, standard deviation, maximum and minimum, etc.
Plot Your Data
Plotting data is crucial to any data analysis package. Pandas utilises on the matplotlib plotting package, and as with the other examples we’ve seen, plotting is very easy:
fig = plt.figure() mydata.plot(x='x column', y='y column', style='-')
We select which columns to plot using the column heading labels. This makes it very easy to keep track of exactly which data we are plotting.
Add New Column
It’s all well and good being able to load and view existing data, but also important is being able to add new columns.
In this example I add a new column to an existing data frame. I create a new column filled with data made by finding the cosine of an angle defined in an existing column. We define in advance the function to apply to the data in the first column.
def dcos(theta): theta = theta*(math.pi/180) return math.cos(theta) mydata['New Column1'] = pd.Series(mydata["Theta"].apply(dcos), index=mydata.index)
More on adding new columns to a Pandas data frame
Save Data to A File
As with reading from a file, writing to a file is very easy with pandas:
mydata.to_csv('Newcsv.dat')
really easy to do, data comes out really nicely.
More on saving data frames to a file.
Pandas Tutorials
I’ve gone through a few of the simpler tasks you might need, but to really explore what Pandas could do for you, you should download it and have a go yourself. Pandas is well supported, and well documented, so you’ll be able to find plenty of examples to help you out.
Here are a few guides to help get you started:
http://pandas.pydata.org/pandas-docs/dev/basics.html
http://pandas.pydata.org/pandas-docs/dev/10min.html
http://pandas.pydata.org/pandas-docs/dev/cookbook.html#cookbook
http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/
In particular I found this one really helpful:
http://synesthesiam.com/posts/an-introduction-to-pandas.html
Image: Jay;Some rights reserved