8

I am working with/writing a good amount of code in python using pandas dataframes. One thing I'm really struggling with is how to enforce a "schema" of sorts or make it apparent what data fields are inside the dataframe. For example

say I have a dataframe df with the following columns

customer_id | order_id | order_amount | order_date | order_time | 

now I have some function that is get_average_order_amount_per_customer which will just take the average of the column order_amount per customer. Like a group by

def get_average_order_amount_per_customer(df): df = df.groupby(['customer_id']).mean() return pd.DataFrame(df['order_amount']) 

Now when I go look at this function a few weeks from now I have no idea what is inside this dataframe other than customer_id and order_amount. I would need to go look at preprocessing steps that use that DF and hope to find other functions that use order_id, order_date, order_time. This sometimes requires me tracing back the processing/usage all the way to the file/database schemas where it originated. What I would really love is if the dataframe was strongly-typed or had some schema that was visible in the code without printing it out and checking logs. so I could see what columns it has and rename them if needed or add a field with a default value like I would with a class.

Like in a class I could just make an Order object and put the fields that I want in there, and then I can just check the Order class file and see what fields are available for use.

I can't get rid of dataframes all together because some of the code is relying on dataframes as inputs i.e. some machine learning libraries like scikit-learn and for doing some visualization with the dataframes.

Using Python Typing library I don't think I can name a schema that is inside a dataframe.

So is there any type of design pattern or technique I could follow which would allow me to over come this hurdle of ambiguity in the dataframe contents?

    4 Answers 4

    3

    I had the same problem and resorted to dataclasses. In a shell something like:

    @dataclass class OrderDataFields: """ data field names aka columns """ order_id : str = 'order_id' order_amount : str = 'order_amount' order_date : str = 'order_date' @dataclass class OrderData: data: pd.DataFrame columns: OrdrDataFields = OrderDataFields() 

    Now when you are using your dataframe put it in the class

    order_data = OrderData(data=order_df) 

    If you want to perform column checks every time you instantiate Data objects you can use BaseData class and inherit it in your for example OrderData object:

    @dataclass class BaseData: """ Base class for all Data dataclasses. Expected child attributes: `data` : pd.DataFrame -- data `columns`: object -- dataclass with column names Raises ------ ValueError If columns in `columns` and `data` dataframe don't match """ data: pd.DataFrame columns: object def __post_init__(self): self._check_columns() def _check_columns(self): """ Check if columns in dataframe match the columns in `columns` attribute Raises ------ ValueError If columns don't match """ data_columns = set(self.data.columns) columns_columns = {v for k,v in asdict(self.columns).items()} if data_columns != columns_columns: raise ValueError @dataclass class OrderData(BaseData): data: pd.DataFrame columns: OrdrDataFields = OrderDataFields() 

    And now when you do some wrangling you can use dataframe and columns:

    df = order_data.data c = order_data.columns df[c.order_amount] .... .... 

    Along those lines, adjust for your case

    There is also library pandera : https://pandera.readthedocs.io/en/stable/

    4
    • 1
      This is a super nice solution!CommentedOct 7, 2021 at 18:52
    • 1
      Also I prefer your method to pandera as with pandera you have to remember to explicitly validate the frame every time, yours feels a bit more robust and is an actual class to pass around. Again pandera as far as I can tell is basically a validation toolCommentedOct 7, 2021 at 19:36
    • @SimonNicholls Thanks, I have replaced the Base Class with regular class using Generic Type and had plan to include pandera as additional check: hastebin.com/upibuhideq.py . It served me good until I've decided to switch to C#.CommentedOct 9, 2021 at 22:25
    • @SimonNicholls fixed docstring: hastebin.com/kamagimoju.pyCommentedOct 9, 2021 at 22:33
    1

    My first idea would be to include type hints and descriptive docstrings to functions responsible for loading a pandas DataFrame, e.g.:

    import pandas as pd def load_client_data(input_path: str) -> pd.DataFrame: """Loads client DataFrame from a csv file, performs data preprocessing and returns it. The client DataFrame is formatted as follows: costumer_id (string): represents a customer. order_id (string): represents an order. order_amount (int): represents the number of items bought. order_date (string): the date in which the order was made (YYYY-MM-DD). order_time (string): the time in which the order was made (HH:mm:ss).""" client_data = pd.read_csv(input_path) preprocessed_client_data = do_preprocessing(client_data) return preprocessed_client_data 

    Ideally, all functions responsible for loading the datasets would be bundled together in a module, so that at the very least you know where to look whenever you're in doubt. Good/consistent variable names for your datasets will also help you keep track of what dataset you're working with in a downstream function.

    Of course, this all adds a bit of coupling: if you decide to change the columns of a dataset, you need to remember to update the docstring, too. At the end of the day, however, it's a choice between flexibility and reliability: once your program grows in size and becomes more stable, I think it's a fair compromise.

    You'll also want to perform any operations to the dataset itself (adding new columns, parsing the date into day/month/year columns, etc) as soon as possible, so that the docstring reflects these in-memory changes as well. If your datasets are being transformed all the way down in another function, ask yourself if you could do this earlier. If that's not possible, at least initialize the dataframe with empty columns that expect future data, and reflect this information on the docstring.

    If you want to take this a step further, you can wrap all functions related to loading datasets into a DatasetManager class, which unify the information of the datasets' signatures. You could even add a helper function to quickly view a docstring for a specific dataset: writing dataset_manager.get_info('client_data') could print out the docstring for the load_client_data function, for example.

    Lastly, there are a couple third-party modules that help you enforce data types in pandas DataFrames, if you're okay with that. An example is dataenforce, but as a disclaimer I've never used it personally.

    1
    • Well, comments are no guarantee of the structure and contents of a DataFrame.
      – async
      CommentedAug 9, 2022 at 8:34
    1

    You can assert the .dtypes or the individual df.column.dtype.

    Alternatively, while .astype() is used to convert datatype of columns, you can also use it to document the schema of data, by defining what columns are available and the types of those columns. The dtypes can also be specified in the dataframe constructor.

      1

      You can write a function that validates the data, and decorate the functions you want.

      If you can express your test as a python function:

      def has_columns(df, columns): """ Checks whether all `columns` are in `df` Retuns a boolean result and the missing columns """ if isinstance(columns, str): # to prevent the later `set` command from mangling our string columns = {columns} return set(df) >= set(columns), set(columns) - set(df) 

      Then you can make a decorator:

      from functools import wraps def has_columns_decorator(columns, df_name="df"): """ Checks for presence of `columns` in an argument to the decorated function Expects a function with a DataFrame as keyword argument `df_name` or as first argument Checks whether all `columns` are columns in the DataFrame Raises a ValueError if the check fails """ def decorate(func): @wraps(func) def wrapper(*args, **kwargs): if df_name in kwargs: df = kwargs.pop(df_name) else: df, *args = args check_result, missing_columns = has_columns(df, columns) if not check_result: raise ValueError( f"Not all columns are present: {missing_columns}" ) result = func(*args, df=df, **kwargs) return result return wrapper return decorate 

      You can make this validation as complex as you like, checking dtypes, whether it is larger than 0, etc

      You can use it like this:

      @has_columns_decorator("a") def my_func(df): return df my_func(df) 
       a b 0 0 a 1 1 b 2 2 c 
      ​ @has_columns_decorator(["a", "c"]) def my_func2(df): return df my_func2(df) 
      --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-53-997c72a8b535> in <module> ----> 1 my_func2(df) <ipython-input-50-36b8ff709aa9> in wrapper(*args, **kwargs) 28 if not check_result: 29 raise ValueError( ---> 30 f"Not all columns are present: {missing_columns}" 31 ) 32 ValueError: Not all columns are present: {'c'} 

      You can make your checks as elaborate as you want. engarde is a package that already has some checks for you

        Start asking to get answers

        Find the answer to your question by asking.

        Ask question

        Explore related questions

        See similar questions with these tags.