Split Pandas dataset column based on values (suffixes: string operation)

Question

In Python using Pandas, I am splitting a dataset column into 4 lists based on the suffix of the values. For the 3 suffixes I am using a list comprehension then for the 4th one, a set operation that substracts the 3 lists from the original list with all values:

import pandas as pd df = pd.DataFrame({ "alcohol_by_volume": [], "barcode": [], "calcium_per_hundred": [], "calcium_unit": [], "carbohydrates_per_hundred": [], "carbohydrates_per_portion": [], "carbohydrates_unit": [], "cholesterol_per_hundred": [], "cholesterol_unit": [], "copper_cu_per_hundred": [], "copper_cu_unit": [], "country": [], "created_at": [], "energy_kcal_per_hundred": [], "energy_kcal_per_portion": [], "energy_kcal_unit": [], "energy_per_hundred": [], "energy_per_portion": [], "energy_unit": [], "fat_per_hundred": [], "fat_per_portion": [], "fat_unit": [], "fatty_acids_total_saturated_per_hundred": [], "fatty_acids_total_saturated_unit": [], "fatty_acids_total_trans_per_hundred": [], "fatty_acids_total_trans_unit": [], "fiber_insoluble_per_hundred": [], "fiber_insoluble_unit": [], "fiber_per_hundred": [], "fiber_per_portion": [], "fiber_soluble_per_hundred": [], "fiber_soluble_unit": [], "fiber_unit": [], "folate_total_per_hundred": [], "folate_total_unit": [], "folic_acid_per_hundred": [], "folic_acid_unit": [], "hundred_unit": [], "id": [], "ingredients_en": [], "iron_per_hundred": [], "iron_unit": [], "magnesium_per_hundred": [], "magnesium_unit": [], "manganese_mn_per_hundred": [] }) colnames_all = df.columns.to_list() colnames_unit = [n for n in colnames_all if n.endswith("_unit")] colnames_per_hundred = [n for n in colnames_all if n.endswith("_per_hundred")] colnames_per_portion = [n for n in colnames_all if n.endswith("_per_portion")] colnames_other = list( set(colnames_all) - set(colnames_unit + colnames_per_hundred + colnames_per_portion) )

Expected result (2 examples, other 2 lists are similar to 1st one):

colnames_unit: ['calcium_unit', 'carbohydrates_unit', 'cholesterol_unit', 'copper_cu_unit', 'energy_kcal_unit', 'energy_unit', 'fat_unit', 'fatty_acids_total_saturated_unit', 'fatty_acids_total_trans_unit', 'fiber_insoluble_unit', 'fiber_soluble_unit', 'fiber_unit', 'folate_total_unit', 'folic_acid_unit', 'hundred_unit', 'iron_unit', 'magnesium_unit'] colnames_other: ['ingredients_en', 'country', 'id', 'created_at', 'barcode', 'alcohol_by_volume']

However this does not look like the best way to do this. Is there a "better" way, i.e. shorter and/or more elegant/idiomatic?

It's hard to review this small fragment in isolation. It would be better to present a complete function, with its unit tests (or at least some sample input to illustrate it). — Toby Speight, CommentedJul 3, 2023 at 9:40
@TobySpeight Added full code for repro. I did not define a function for this, maybe this is already part of the better way to do it...? The code repetition and the set subtraction don't look good to me. — evilmandarine, CommentedJul 3, 2023 at 17:42

J_H · Accepted Answer · 2023-07-03 19:24:40Z

colnames_all = df.columns.to_list()

I don't see a clear need for this. We could simply refer to df.columns instead.

list( set(colnames_all) - set(colnames_unit + colnames_per_hundred + colnames_per_portion) )

That doesn't seem so bad, to me. Certainly the intent is clear.

colnames_unit = [n for n in colnames_all if n.endswith("_unit")]

Consider rephrasing this as

colnames_unit = [n for n in colnames_all if re.search(r'_unit$', n)]

That lets us generalize in this way:

colnames_measured = [n for n in df.columns if re.search(r'_(unit|per_hundred|per_portion)$', n)]

To find the inverse:

colnames_other = [n for n in df.columns if not re.search(r'_(unit|per_hundred|per_portion)$', n)]

Maybe it is not clear but I need a different list by suffix, so colnames_measured does not work for me. Also as it is a constant known suffix, endswith() seems ok, this is not the issue. The question is should this be a 10'000 list with say a collection of 100 suffixes, what would be the best way to address this. Thank you for your input though. — evilmandarine, CommentedJul 3, 2023 at 20:03

Reinderien · Accepted Answer · 2023-07-03 22:29:57Z

Don't use comprehensions. Don't use lists. Don't use sets. Use Pandas string vectorisation:

colnames_all = df.columns is_unit = colnames_all.str.endswith("_unit") is_hundred = colnames_all.str.endswith("_per_hundred") is_portion = colnames_all.str.endswith("_per_portion") colnames_unit = colnames_all[is_unit] colnames_per_hundred = colnames_all[is_hundred] colnames_per_portion = colnames_all[is_portion] colnames_other = colnames_all[~(is_unit | is_hundred | is_portion)] print(colnames_other)

Index(['alcohol_by_volume', 'barcode', 'country', 'created_at', 'id', 'ingredients_en'], dtype='object')

Why are you advising against lists, is it because of readability, performance, or other reason...? This SO Q&A goes through some detailed and interesting points about this. I like this method, though. I'm thinking a filtering function may be the best way to avoid code repetition. — evilmandarine, CommentedJul 4, 2023 at 8:42
It's not in the Pandas style, and (though it matters more for large input) vectorized operations will be faster — Reinderien, CommentedJul 4, 2023 at 12:53

Stack Exchange Network

Split Pandas dataset column based on values (suffixes: string operation)

2 Answers 2

Hot Network Questions

Split Pandas dataset column based on values (suffixes: string operation)

2 Answers 2

Related

Hot Network Questions