Data tables often come in a format that makes sense to the human who created the table, but that’s difficult for analysis. To make analysis easier we can reshape the data into a more computer-friendly form. Pandas is a python data analysis library and in this post I reshape pandas data with melt.
A Reshape Example
Simple Dataframe
Firstly, let’s create a simple dataframe in pandas:
data = {'weekday': ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"], 'Person 1': [12, 6, 5, 8, 11, 6, 4], 'Person 2': [10, 6, 11, 5, 8, 9, 12], 'Person 3': [8, 5, 7, 3, 7, 11, 15]} df = pd.DataFrame(data, columns=['weekday', 'Person 1', 'Person 2', 'Person 3'])
This dataframe is similar to how we might have recorded the data by hand or in a spreadsheet, and could represent test scores of three different people on each day of the week.There is a row for each day of the week, and a separate column for each person:
weekday | Person 1 | Person 2 | Person 3 | |
---|---|---|---|---|
0 | Monday | 12 | 10 | 8 |
1 | Tuesday | 6 | 6 | 5 |
2 | Wednesday | 5 | 11 | 7 |
3 | Thursday | 8 | 5 | 3 |
4 | Friday | 11 | 8 | 7 |
5 | Saturday | 6 | 9 | 11 |
6 | Sunday | 4 | 12 | 15 |
Fine For Humans – Not So Good For Analysis
While this simple dataframe structure is fine for a human reader, if we wanted to create use ‘groupby‘ operations or begin to normalise this dataframe to we would need to do some reshaping. We could reshape the dataframe to have a column containing person information, and another column to contain that person’s score on a particular day, like this:
weekday | Person | Score |
---|
Reshape With Melt
It is of course possible to reshape a data table by hand, by copying and pasting the values from each person’s column into the new ‘person’ column. This would take a a long time even for this small dataframe, and would be prone to errrors. A much better idea is to reshape the dataframe with melt:
melted = pd.melt(df, id_vars=["weekday"], var_name="Person", value_name="Score")
Here we have set the variables (columns) that we want to leave unaffected. Variables not included in this list will become rows in a new column (which has the name given by “var_name”). The values belonging to the original rows/columns are found in a new column with a name given by “value_name”, and the output dataframe now has three rows for each day of the week – one for each of person 1, 2, and 3. This new form of the dataframe is now more ‘normalised’ and can have groupby operations applied to it.
weekday | Person | Score | |
---|---|---|---|
0 | Monday | Person 1 | 12 |
1 | Tuesday | Person 1 | 6 |
2 | Wednesday | Person 1 | 5 |
3 | Thursday | Person 1 | 8 |
4 | Friday | Person 1 | 11 |
5 | Saturday | Person 1 | 6 |
6 | Sunday | Person 1 | 4 |
7 | Monday | Person 2 | 10 |
8 | Tuesday | Person 2 | 6 |
9 | Wednesday | Person 2 | 11 |
10 | Thursday | Person 2 | 5 |
11 | Friday | Person 2 | 8 |
12 | Saturday | Person 2 | 9 |
13 | Sunday | Person 2 | 12 |
14 | Monday | Person 3 | 8 |
15 | Tuesday | Person 3 | 5 |
16 | Wednesday | Person 3 | 7 |
17 | Thursday | Person 3 | 3 |
18 | Friday | Person 3 | 7 |
19 | Saturday | Person 3 | 11 |
20 | Sunday | Person 3 | 15 |
Read more about using melt at stackoverflow
good