Slow processing of a python dataframe when aggregating across rows and columns

Question

I would do this in SQL using string_agg but the server is SQL Server 2012 and beyond my control. So I'm trying a python approach.

I have a dataframe of shape [20225 rows x 7 columns], and there a bit of transformation required. There are sometimes duplicate rows, but only in one column. So what I want to do is find the duplicate rows (where the name is the same) and then

Concatenate all the email addresses in three columns and name matching rows into one string (dropping nulls)
Concatenate all the company names in three columns and name matching rows into one string (dropping nulls)
Create a new dataframe of shape [20106 rows x 3 columns] that then has one row per name, with a single string of email addresses in the second column, and a single string of companies in the third column.

Basically, the duplicate rows have been eliminated, and the different email addresses/companynames have been concatenated.

My code works, and takes about 6 minutes to run... I don't know enough about this, but I have a hunch it could be a lot faster. I'm just looking for some pointers as to maybe structuring it differently? Thanks for any guidance.

EXAMPLE DATA

Name	People1.Email	People1.CompanyName	People2.Email	People2.CompanyName	People3.Email	People3.CompanyName
Person A	[email protected]	CompanyName	[email protected]	CompanyName
Person A	[email protected]	CompanyName	[email protected]	CompanyName
Person B			[email protected]	CompanyName	[email protected]	CompanyName
Person C	[email protected]	CompanyName	[email protected]	CompanyName	[email protected] CompanyName
Person D	[email protected]	CompanyName
Person D			[email protected]	CompanyName
Person D	[email protected]	CompanyName	[email protected]	CompanyName
Person E	[email protected]	CompanyName	[email protected]	CompanyName	[email protected]	CompanyName
Person E	[email protected]	CompanyName	[email protected]	CompanyName	[email protected]	CompanyName

Name	Emails	Companies
Person A	[email protected];[email protected];[email protected];[email protected]	CompanyName; CompanyName;CompanyName; CompanyName
Person B	[email protected];[email protected]	CompanyName; CompanyName;CompanyName
etc

*DATA TYPES* Name object People1.Email object People2.CompanyName object People1.Email object People2.CompanyName object People3.Email object People4.CompanyName object *CODE* print (time.strftime("%H:%M:%S", time.localtime()) + " start") pd_xl_file = pd.ExcelFile(r'C:\sample.xlsx') df = pd_xl_file.parse(0) listOfPeople = df['Name'].unique().tolist() # Now creata new df to hold the final result df_new = pd.DataFrame() for person in listOfPeople: lstCompanies = df.loc[df['Name'] == person, 'People1.CompanyName'].unique().tolist() + df.loc[df['Name'] == person, 'People2.CompanyName'].unique().tolist() + df.loc[df['Name'] == person, 'People3.CompanyName'].unique().tolist() Companies = [x for x in lstCompanies if pd.isnull(x) == False] lstEmails = df.loc[df['Name'] == person, 'People1.Email'].unique().tolist() + df.loc[df['Name'] == person, 'People2.Email'].unique().tolist() + df.loc[df['Name'] == person, 'People3.Email'].unique().tolist() Emails = [x for x in lstEmails if pd.isnull(x) == False] # initialize list of lists c = ' '.join([item for item in Companies]) e = ' '.join([item for item in Emails]) # append to the final result new_row = pd.DataFrame({'Name':person, 'Companies':c, 'Emails':e}, index=[0]) df_new = pd.concat([new_row,df_new.loc[:]]).reset_index(drop=True) print ('.', end='') print (df_new) print (time.strftime("%H:%M:%S", time.localtime()) + " end")

Reinderien · Accepted Answer · 2022-07-15 00:24:24Z

Don't for in listOfPeople, and don't tolist.

Your data are misshapen. There should not be multiple Email and CompanyName columns; there should only be one each.

Group by the name, and then aggregate using a string join.

Suggested

import pandas as pd df = pd.read_csv('278083.csv', index_col='Name') to_concat = [] for i in range(1, df.shape[1]//2 + 1): email = f'People{i}.Email' company = f'People{i}.CompanyName' sub = ( df[[email, company]] .dropna() .rename({ email: 'Email', company: 'Company' }, axis='columns') ) sub['Contact'] = i to_concat.append(sub) df = pd.concat(to_concat).set_index(keys='Contact', append=True) join = ';'.join combined = df.groupby('Name').agg({ 'Email': join, 'Company': join, })

Thanks, but the misshapen data is my reality. This is people data, right? I have a single "person" described with more than one email and more than one company name, in fact three of each. So I am trying to merge them into a searchable string. — Maxcot, CommentedJul 15, 2022 at 1:35
The suggested code handles this by reshaping to a single email column with multiple values. If you can't control the format of the data, you should process it into this form. — Reinderien, CommentedJul 15, 2022 at 2:22
Thanks. After revisiting the creation of the data, I was able to sort it out. Beautiful solution ... does the job in under 10 sec. Much appreciated. If you have the time, what is this syntax" df.shape[1]//2 + 1 mean? — Maxcot, CommentedJul 15, 2022 at 3:26

Stack Exchange Network

Slow processing of a python dataframe when aggregating across rows and columns

1 Answer 1

Suggested

Hot Network Questions

Slow processing of a python dataframe when aggregating across rows and columns

1 Answer 1

Suggested

Related

Hot Network Questions