4
\$\begingroup\$

Given the CSV data:

,fan1,fan2,foil1,foil2 0,0.0,0.0,0.0,0.125 1,0.0625,0.0,0.0625,0.125 2,0.0625,0.0,0.0,0.3125 

Which I want to turn into a kind of annotated pivot-table which can be plotted as a bar-plot:

,Err,PairType,StimType 0,0.0,Target,1 1,0.0625,Target,1 2,0.0625,Target,1 0,0.0,Target,2 1,0.0,Target,2 2,0.0,Target,2 0,0.0,RPFoil,1 1,0.0625,RPFoil,1 2,0.0,RPFoil,1 0,0.125,RPFoil,2 1,0.125,RPFoil,2 2,0.3125,RPFoil,2 

I currently accomplish this with the following code:

import numpy as np import pandas as pd def df_plotable(model_err: pd.DataFrame): t_len = len(model_err.fan1) cols = ("Err", "PairType", "StimType") fan1_df = pd.DataFrame(np.array([model_err.fan1, ["Fan"]*t_len, [1]*t_len]).T, columns=cols) fan2_df = pd.DataFrame(np.array([model_err.fan2, ["Fan"]*t_len, [2]*t_len]).T, columns=cols) foil1_df = pd.DataFrame(np.array([model_err.foil1, ["Foil"]*t_len, [1]*t_len]).T, columns=cols) foil2_df = pd.DataFrame(np.array([model_err.foil2, ["Foil"]*t_len, [2]*t_len]).T, columns=cols) new_model_err = pd.concat((fan1_df, fan2_df, foil1_df, foil2_df)) new_model_err["Err"] = new_model_err["Err"].astype(float) new_model_err["StimType"] = new_model_err["StimType"].astype(int) return new_model_err 

Such that:

df = pd.read_csv("in.csv", "r", delimiter=",", index_col=0) df_plotable(df).to_csv("out.csv") 

Is there a way to do this more cleanly?

\$\endgroup\$
0

    1 Answer 1

    2
    \$\begingroup\$

    Don't hard-code your transformation

    With your current approach, as soon as you are faced with a new PairType or StimType you will have to adjust your function accordingly to account for them. What your current code is doing is really a hard-coded version of a wide-form to long-form conversion of the column data - and Pandas has methods to allow you to do that in an automated way.

    Two options would be df.melt, or a combination of df.unstack and reset_index.

    Either way, after this step you'll nearly be all the way there.

    >>> model_err.melt(var_name='PairStim', value_name='Err') PairStim Err 0 fan1 0.0000 1 fan1 0.0625 2 fan1 0.0625 3 fan2 0.0000 4 fan2 0.0000 5 fan2 0.0000 6 foil1 0.0000 7 foil1 0.0625 8 foil1 0.0000 9 foil2 0.1250 10 foil2 0.1250 11 foil2 0.3125 

    The only other step in automation is to split the PairStim column up into its components, and do some cleanup. Wrapping this up in your function:

    def longform_model_error(model_err: pd.DataFrame): melted = model_err.melt(var_name='PairStim', value_name='Err') melted[['PairType', 'StimType']] = ( melted['PairStim'].str.extract('(\w+)(\d+)', expand=True) ) melted['PairType'] = melted['PairType'].str.capitalize() melted['StimType'] = melted['StimType'].astype(int) return melted.drop('PairStim', axis='columns') 

    Comments on your current approach

    • Consider using the .size attribute on a Series (or the .shape attribute) instead of len. This makes it more clear to me that model_err.fan1 is indeed a Pandas type.

    • If you did want to hard-code the transformation for some reason, creating a DataFrame for each group and concat-ing them together is not ideal - you might as well remain in NumPy-land for as long as possible before turning it into a DataFrame. You would also benefit from hard-coding the separate columns instead of separate rows, as the dtypes of the columns are homogeneous, so you'd avoid the subsequent casting. Perhaps something like

      def longform_model_error_hardcode(model_err: pd.DataFrame): n_per_type = model_err.shape[0] pair_types = ('Fan', 'Foil') stim_types = (1, 2) n_stims = len(stim_types) cols = { 'Err': model_err.values.ravel('F'), 'PairType': np.repeat(pair_types, n_per_type*n_stims), 'StimType': np.tile(np.repeat(stim_types, n_per_type), n_stims) } return pd.DataFrame(cols) 

      This will be faster, though if you were to go down the hard-coding avenue there isn't really much point in using Pandas at all if you're not performing other operations on the data - might as well just stick to NumPy in that case.

    • Nitpicking, df_plotable and new_model_err don't really help me understand what the function is supposed to do, or how new_model_err is different from the input DataFrame. Try and use some more descriptive variable names.

    \$\endgroup\$

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.