1
\$\begingroup\$

In Python using Pandas, I am splitting a dataset column into 4 lists based on the suffix of the values. For the 3 suffixes I am using a list comprehension then for the 4th one, a set operation that substracts the 3 lists from the original list with all values:

import pandas as pd df = pd.DataFrame({ "alcohol_by_volume": [], "barcode": [], "calcium_per_hundred": [], "calcium_unit": [], "carbohydrates_per_hundred": [], "carbohydrates_per_portion": [], "carbohydrates_unit": [], "cholesterol_per_hundred": [], "cholesterol_unit": [], "copper_cu_per_hundred": [], "copper_cu_unit": [], "country": [], "created_at": [], "energy_kcal_per_hundred": [], "energy_kcal_per_portion": [], "energy_kcal_unit": [], "energy_per_hundred": [], "energy_per_portion": [], "energy_unit": [], "fat_per_hundred": [], "fat_per_portion": [], "fat_unit": [], "fatty_acids_total_saturated_per_hundred": [], "fatty_acids_total_saturated_unit": [], "fatty_acids_total_trans_per_hundred": [], "fatty_acids_total_trans_unit": [], "fiber_insoluble_per_hundred": [], "fiber_insoluble_unit": [], "fiber_per_hundred": [], "fiber_per_portion": [], "fiber_soluble_per_hundred": [], "fiber_soluble_unit": [], "fiber_unit": [], "folate_total_per_hundred": [], "folate_total_unit": [], "folic_acid_per_hundred": [], "folic_acid_unit": [], "hundred_unit": [], "id": [], "ingredients_en": [], "iron_per_hundred": [], "iron_unit": [], "magnesium_per_hundred": [], "magnesium_unit": [], "manganese_mn_per_hundred": [] }) colnames_all = df.columns.to_list() colnames_unit = [n for n in colnames_all if n.endswith("_unit")] colnames_per_hundred = [n for n in colnames_all if n.endswith("_per_hundred")] colnames_per_portion = [n for n in colnames_all if n.endswith("_per_portion")] colnames_other = list( set(colnames_all) - set(colnames_unit + colnames_per_hundred + colnames_per_portion) ) 

Expected result (2 examples, other 2 lists are similar to 1st one):

colnames_unit: ['calcium_unit', 'carbohydrates_unit', 'cholesterol_unit', 'copper_cu_unit', 'energy_kcal_unit', 'energy_unit', 'fat_unit', 'fatty_acids_total_saturated_unit', 'fatty_acids_total_trans_unit', 'fiber_insoluble_unit', 'fiber_soluble_unit', 'fiber_unit', 'folate_total_unit', 'folic_acid_unit', 'hundred_unit', 'iron_unit', 'magnesium_unit'] colnames_other: ['ingredients_en', 'country', 'id', 'created_at', 'barcode', 'alcohol_by_volume'] 

However this does not look like the best way to do this. Is there a "better" way, i.e. shorter and/or more elegant/idiomatic?

\$\endgroup\$
2
  • 1
    \$\begingroup\$It's hard to review this small fragment in isolation. It would be better to present a complete function, with its unit tests (or at least some sample input to illustrate it).\$\endgroup\$CommentedJul 3, 2023 at 9:40
  • 1
    \$\begingroup\$@TobySpeight Added full code for repro. I did not define a function for this, maybe this is already part of the better way to do it...? The code repetition and the set subtraction don't look good to me.\$\endgroup\$CommentedJul 3, 2023 at 17:42

2 Answers 2

1
\$\begingroup\$
colnames_all = df.columns.to_list() 

I don't see a clear need for this. We could simply refer to df.columns instead.


list( set(colnames_all) - set(colnames_unit + colnames_per_hundred + colnames_per_portion) ) 

That doesn't seem so bad, to me. Certainly the intent is clear.


colnames_unit = [n for n in colnames_all if n.endswith("_unit")] 

Consider rephrasing this as

colnames_unit = [n for n in colnames_all if re.search(r'_unit$', n)] 

That lets us generalize in this way:

colnames_measured = [n for n in df.columns if re.search(r'_(unit|per_hundred|per_portion)$', n)] 

To find the inverse:

colnames_other = [n for n in df.columns if not re.search(r'_(unit|per_hundred|per_portion)$', n)] 
\$\endgroup\$
1
  • \$\begingroup\$Maybe it is not clear but I need a different list by suffix, so colnames_measured does not work for me. Also as it is a constant known suffix, endswith() seems ok, this is not the issue. The question is should this be a 10'000 list with say a collection of 100 suffixes, what would be the best way to address this. Thank you for your input though.\$\endgroup\$CommentedJul 3, 2023 at 20:03
1
\$\begingroup\$

Don't use comprehensions. Don't use lists. Don't use sets. Use Pandas string vectorisation:

colnames_all = df.columns is_unit = colnames_all.str.endswith("_unit") is_hundred = colnames_all.str.endswith("_per_hundred") is_portion = colnames_all.str.endswith("_per_portion") colnames_unit = colnames_all[is_unit] colnames_per_hundred = colnames_all[is_hundred] colnames_per_portion = colnames_all[is_portion] colnames_other = colnames_all[~(is_unit | is_hundred | is_portion)] print(colnames_other) 
Index(['alcohol_by_volume', 'barcode', 'country', 'created_at', 'id', 'ingredients_en'], dtype='object') 
\$\endgroup\$
2
  • \$\begingroup\$Why are you advising against lists, is it because of readability, performance, or other reason...? This SO Q&A goes through some detailed and interesting points about this. I like this method, though. I'm thinking a filtering function may be the best way to avoid code repetition.\$\endgroup\$CommentedJul 4, 2023 at 8:42
  • \$\begingroup\$It's not in the Pandas style, and (though it matters more for large input) vectorized operations will be faster\$\endgroup\$CommentedJul 4, 2023 at 12:53

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.