2
\$\begingroup\$

I would do this in SQL using string_agg but the server is SQL Server 2012 and beyond my control. So I'm trying a python approach.

I have a dataframe of shape [20225 rows x 7 columns], and there a bit of transformation required. There are sometimes duplicate rows, but only in one column. So what I want to do is find the duplicate rows (where the name is the same) and then

  1. Concatenate all the email addresses in three columns and name matching rows into one string (dropping nulls)
  2. Concatenate all the company names in three columns and name matching rows into one string (dropping nulls)
  3. Create a new dataframe of shape [20106 rows x 3 columns] that then has one row per name, with a single string of email addresses in the second column, and a single string of companies in the third column.

Basically, the duplicate rows have been eliminated, and the different email addresses/companynames have been concatenated.

My code works, and takes about 6 minutes to run... I don't know enough about this, but I have a hunch it could be a lot faster. I'm just looking for some pointers as to maybe structuring it differently? Thanks for any guidance.

EXAMPLE DATA

NamePeople1.EmailPeople1.CompanyNamePeople2.EmailPeople2.CompanyNamePeople3.EmailPeople3.CompanyName
Person A[email protected]CompanyName[email protected]CompanyName
Person A[email protected]CompanyName[email protected]CompanyName
Person B[email protected]CompanyName[email protected]CompanyName
Person C[email protected]CompanyName[email protected]CompanyName[email protected] CompanyName
Person D[email protected]CompanyName
Person D[email protected]CompanyName
Person D[email protected]CompanyName[email protected]CompanyName
Person E[email protected]CompanyName[email protected]CompanyName[email protected]CompanyName
Person E[email protected]CompanyName[email protected]CompanyName[email protected]CompanyName
NameEmailsCompanies
Person A[email protected];[email protected];[email protected];[email protected]CompanyName; CompanyName;CompanyName; CompanyName
Person B[email protected];[email protected]CompanyName; CompanyName;CompanyName
etc
*DATA TYPES* Name object People1.Email object People2.CompanyName object People1.Email object People2.CompanyName object People3.Email object People4.CompanyName object *CODE* print (time.strftime("%H:%M:%S", time.localtime()) + " start") pd_xl_file = pd.ExcelFile(r'C:\sample.xlsx') df = pd_xl_file.parse(0) listOfPeople = df['Name'].unique().tolist() # Now creata new df to hold the final result df_new = pd.DataFrame() for person in listOfPeople: lstCompanies = df.loc[df['Name'] == person, 'People1.CompanyName'].unique().tolist() + df.loc[df['Name'] == person, 'People2.CompanyName'].unique().tolist() + df.loc[df['Name'] == person, 'People3.CompanyName'].unique().tolist() Companies = [x for x in lstCompanies if pd.isnull(x) == False] lstEmails = df.loc[df['Name'] == person, 'People1.Email'].unique().tolist() + df.loc[df['Name'] == person, 'People2.Email'].unique().tolist() + df.loc[df['Name'] == person, 'People3.Email'].unique().tolist() Emails = [x for x in lstEmails if pd.isnull(x) == False] # initialize list of lists c = ' '.join([item for item in Companies]) e = ' '.join([item for item in Emails]) # append to the final result new_row = pd.DataFrame({'Name':person, 'Companies':c, 'Emails':e}, index=[0]) df_new = pd.concat([new_row,df_new.loc[:]]).reset_index(drop=True) print ('.', end='') print (df_new) print (time.strftime("%H:%M:%S", time.localtime()) + " end") 
\$\endgroup\$

    1 Answer 1

    1
    \$\begingroup\$

    Don't for in listOfPeople, and don't tolist.

    Your data are misshapen. There should not be multiple Email and CompanyName columns; there should only be one each.

    Group by the name, and then aggregate using a string join.

    Suggested

    import pandas as pd df = pd.read_csv('278083.csv', index_col='Name') to_concat = [] for i in range(1, df.shape[1]//2 + 1): email = f'People{i}.Email' company = f'People{i}.CompanyName' sub = ( df[[email, company]] .dropna() .rename({ email: 'Email', company: 'Company' }, axis='columns') ) sub['Contact'] = i to_concat.append(sub) df = pd.concat(to_concat).set_index(keys='Contact', append=True) join = ';'.join combined = df.groupby('Name').agg({ 'Email': join, 'Company': join, }) 
    \$\endgroup\$
    4
    • \$\begingroup\$Thanks, but the misshapen data is my reality. This is people data, right? I have a single "person" described with more than one email and more than one company name, in fact three of each. So I am trying to merge them into a searchable string.\$\endgroup\$
      – Maxcot
      CommentedJul 15, 2022 at 1:35
    • \$\begingroup\$The suggested code handles this by reshaping to a single email column with multiple values. If you can't control the format of the data, you should process it into this form.\$\endgroup\$CommentedJul 15, 2022 at 2:22
    • \$\begingroup\$Thanks. After revisiting the creation of the data, I was able to sort it out. Beautiful solution ... does the job in under 10 sec. Much appreciated. If you have the time, what is this syntax" df.shape[1]//2 + 1 mean?\$\endgroup\$
      – Maxcot
      CommentedJul 15, 2022 at 3:26
    • \$\begingroup\$Take the number of columns and floor divide by two\$\endgroup\$CommentedJul 15, 2022 at 11:15

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.