Pivot and annotate Pandas DataFrame

Question

Given the CSV data:

,fan1,fan2,foil1,foil2 0,0.0,0.0,0.0,0.125 1,0.0625,0.0,0.0625,0.125 2,0.0625,0.0,0.0,0.3125

Which I want to turn into a kind of annotated pivot-table which can be plotted as a bar-plot:

,Err,PairType,StimType 0,0.0,Target,1 1,0.0625,Target,1 2,0.0625,Target,1 0,0.0,Target,2 1,0.0,Target,2 2,0.0,Target,2 0,0.0,RPFoil,1 1,0.0625,RPFoil,1 2,0.0,RPFoil,1 0,0.125,RPFoil,2 1,0.125,RPFoil,2 2,0.3125,RPFoil,2

I currently accomplish this with the following code:

import numpy as np import pandas as pd def df_plotable(model_err: pd.DataFrame): t_len = len(model_err.fan1) cols = ("Err", "PairType", "StimType") fan1_df = pd.DataFrame(np.array([model_err.fan1, ["Fan"]*t_len, [1]*t_len]).T, columns=cols) fan2_df = pd.DataFrame(np.array([model_err.fan2, ["Fan"]*t_len, [2]*t_len]).T, columns=cols) foil1_df = pd.DataFrame(np.array([model_err.foil1, ["Foil"]*t_len, [1]*t_len]).T, columns=cols) foil2_df = pd.DataFrame(np.array([model_err.foil2, ["Foil"]*t_len, [2]*t_len]).T, columns=cols) new_model_err = pd.concat((fan1_df, fan2_df, foil1_df, foil2_df)) new_model_err["Err"] = new_model_err["Err"].astype(float) new_model_err["StimType"] = new_model_err["StimType"].astype(int) return new_model_err

Such that:

df = pd.read_csv("in.csv", "r", delimiter=",", index_col=0) df_plotable(df).to_csv("out.csv")

Is there a way to do this more cleanly?

miradulo · Accepted Answer · 2018-08-04 15:48:41Z

Don't hard-code your transformation

With your current approach, as soon as you are faced with a new PairType or StimType you will have to adjust your function accordingly to account for them. What your current code is doing is really a hard-coded version of a wide-form to long-form conversion of the column data - and Pandas has methods to allow you to do that in an automated way.

Two options would be df.melt, or a combination of df.unstack and reset_index.

Either way, after this step you'll nearly be all the way there.

>>> model_err.melt(var_name='PairStim', value_name='Err') PairStim Err 0 fan1 0.0000 1 fan1 0.0625 2 fan1 0.0625 3 fan2 0.0000 4 fan2 0.0000 5 fan2 0.0000 6 foil1 0.0000 7 foil1 0.0625 8 foil1 0.0000 9 foil2 0.1250 10 foil2 0.1250 11 foil2 0.3125

The only other step in automation is to split the PairStim column up into its components, and do some cleanup. Wrapping this up in your function:

def longform_model_error(model_err: pd.DataFrame): melted = model_err.melt(var_name='PairStim', value_name='Err') melted[['PairType', 'StimType']] = ( melted['PairStim'].str.extract('(\w+)(\d+)', expand=True) ) melted['PairType'] = melted['PairType'].str.capitalize() melted['StimType'] = melted['StimType'].astype(int) return melted.drop('PairStim', axis='columns')

Comments on your current approach

Consider using the .size attribute on a Series (or the .shape attribute) instead of len. This makes it more clear to me that model_err.fan1 is indeed a Pandas type.
If you did want to hard-code the transformation for some reason, creating a DataFrame for each group and concat-ing them together is not ideal - you might as well remain in NumPy-land for as long as possible before turning it into a DataFrame. You would also benefit from hard-coding the separate columns instead of separate rows, as the dtypes of the columns are homogeneous, so you'd avoid the subsequent casting. Perhaps something like
```
def longform_model_error_hardcode(model_err: pd.DataFrame): n_per_type = model_err.shape[0] pair_types = ('Fan', 'Foil') stim_types = (1, 2) n_stims = len(stim_types) cols = { 'Err': model_err.values.ravel('F'), 'PairType': np.repeat(pair_types, n_per_type*n_stims), 'StimType': np.tile(np.repeat(stim_types, n_per_type), n_stims) } return pd.DataFrame(cols) 
```
This will be faster, though if you were to go down the hard-coding avenue there isn't really much point in using Pandas at all if you're not performing other operations on the data - might as well just stick to NumPy in that case.
Nitpicking, df_plotable and new_model_err don't really help me understand what the function is supposed to do, or how new_model_err is different from the input DataFrame. Try and use some more descriptive variable names.

Stack Exchange Network

Pivot and annotate Pandas DataFrame

1 Answer 1

Don't hard-code your transformation

Comments on your current approach

Hot Network Questions

Pivot and annotate Pandas DataFrame

1 Answer 1

Don't hard-code your transformation

Comments on your current approach

Related

Hot Network Questions