I would do this in SQL using string_agg but the server is SQL Server 2012 and beyond my control. So I'm trying a python approach.
I have a dataframe of shape [20225 rows x 7 columns], and there a bit of transformation required. There are sometimes duplicate rows, but only in one column. So what I want to do is find the duplicate rows (where the name is the same) and then
- Concatenate all the email addresses in three columns and name matching rows into one string (dropping nulls)
- Concatenate all the company names in three columns and name matching rows into one string (dropping nulls)
- Create a new dataframe of shape [20106 rows x 3 columns] that then has one row per name, with a single string of email addresses in the second column, and a single string of companies in the third column.
Basically, the duplicate rows have been eliminated, and the different email addresses/companynames have been concatenated.
My code works, and takes about 6 minutes to run... I don't know enough about this, but I have a hunch it could be a lot faster. I'm just looking for some pointers as to maybe structuring it differently? Thanks for any guidance.
EXAMPLE DATA
Name | People1.Email | People1.CompanyName | People2.Email | People2.CompanyName | People3.Email | People3.CompanyName |
---|---|---|---|---|---|---|
Person A | [email protected] | CompanyName | [email protected] | CompanyName | ||
Person A | [email protected] | CompanyName | [email protected] | CompanyName | ||
Person B | [email protected] | CompanyName | [email protected] | CompanyName | ||
Person C | [email protected] | CompanyName | [email protected] | CompanyName | [email protected] CompanyName | |
Person D | [email protected] | CompanyName | ||||
Person D | [email protected] | CompanyName | ||||
Person D | [email protected] | CompanyName | [email protected] | CompanyName | ||
Person E | [email protected] | CompanyName | [email protected] | CompanyName | [email protected] | CompanyName |
Person E | [email protected] | CompanyName | [email protected] | CompanyName | [email protected] | CompanyName |
Name | Emails | Companies |
---|---|---|
Person A | [email protected];[email protected];[email protected];[email protected] | CompanyName; CompanyName;CompanyName; CompanyName |
Person B | [email protected];[email protected] | CompanyName; CompanyName;CompanyName |
etc |
*DATA TYPES* Name object People1.Email object People2.CompanyName object People1.Email object People2.CompanyName object People3.Email object People4.CompanyName object *CODE* print (time.strftime("%H:%M:%S", time.localtime()) + " start") pd_xl_file = pd.ExcelFile(r'C:\sample.xlsx') df = pd_xl_file.parse(0) listOfPeople = df['Name'].unique().tolist() # Now creata new df to hold the final result df_new = pd.DataFrame() for person in listOfPeople: lstCompanies = df.loc[df['Name'] == person, 'People1.CompanyName'].unique().tolist() + df.loc[df['Name'] == person, 'People2.CompanyName'].unique().tolist() + df.loc[df['Name'] == person, 'People3.CompanyName'].unique().tolist() Companies = [x for x in lstCompanies if pd.isnull(x) == False] lstEmails = df.loc[df['Name'] == person, 'People1.Email'].unique().tolist() + df.loc[df['Name'] == person, 'People2.Email'].unique().tolist() + df.loc[df['Name'] == person, 'People3.Email'].unique().tolist() Emails = [x for x in lstEmails if pd.isnull(x) == False] # initialize list of lists c = ' '.join([item for item in Companies]) e = ' '.join([item for item in Emails]) # append to the final result new_row = pd.DataFrame({'Name':person, 'Companies':c, 'Emails':e}, index=[0]) df_new = pd.concat([new_row,df_new.loc[:]]).reset_index(drop=True) print ('.', end='') print (df_new) print (time.strftime("%H:%M:%S", time.localtime()) + " end")