Pandas dataframe.drop_duplicates()

Last Updated : 25 Nov, 2024

Pandas drop_duplicates() method helps in removing duplicates from the Pandas Dataframe allows to remove duplicate rows from a DataFrame, either based on all columns or specific ones in python.

By default, drop_duplicates() scans the entire DataFrame for duplicate rows and removes all subsequent occurrences, retaining only the first instance being the simple and efficient method. Let’s see a quick example:

Python

importpandasaspddata={"Name":["Alice","Bob","Alice","David"],"Age":[25,30,25,40],"City":["NY","LA","NY","Chicago"]}df=pd.DataFrame(data)display(df)# Removing duplicatesunique_df=df.drop_duplicates()display(unique_df)

Output:

Pandas dataframe.drop_duplicates()

This example demonstrates how duplicate rows are removed while retaining the first occurrence using pandas.DataFrame.drop_duplicates() since it’s commonly used and recommended.

dataframe.drop_duplicates() Syntax in Python :

Syntax: DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)
Parameters:
subset: Subset takes a column or list of column label. It’s default value is none. After passing columns, it will consider them only for duplicates. ( Optional)
keep: keep is to control how to consider duplicate value. It has only three distinct value and default is ‘first’.
If ‘first‘, it considers first value as unique and rest of the same values as duplicate.
If ‘last‘, it considers last value as unique and rest of the same values as duplicate.
If False, it consider all of the same values as duplicates
inplace: Boolean values, removes rows with duplicates if True.
Return type: DataFrame with removed duplicate rows depending on Arguments passed.

Python dataframe.drop_duplicates() : Examples

Duplicate rows can arise due to merging datasets, incorrect data entry, or other reasons. The drop_duplicates() works by identifying duplicates based on all columns (default) or specified columns and removing them as per your requirements. Below, we are discussing examples of dataframe.drop_duplicates() method:

1. Dropping Duplicates Based on Specific Columns

You can target duplicates in specific columns using the subset parameter. This helps when certain fields are more relevant for identifying duplicates.

Python

importpandasaspddf=pd.DataFrame({'Name':['Alice','Bob','Alice','David'],'Age':[25,30,25,40],'City':['NY','LA','SF','Chicago']})# Drop duplicates based on the 'Name' columnresult=df.drop_duplicates(subset=['Name'])print(result)

Output

 Name Age City 0 Alice 25 NY 1 Bob 30 LA 3 David 40 Chicago

Here, duplicates are removed based solely on the Name column, ignoring the other fields. This is helpful when specific columns uniquely identify rows.

2. Keeping the Last Occurrence

By default, drop_duplicates() retains the first occurrence of duplicates. However, you can retain the last duplicate instead using keep='last'.

Python

importpandasaspddf=pd.DataFrame({'Name':['Alice','Bob','Alice','David'],'Age':[25,30,25,40],'City':['NY','LA','NY','Chicago']})# Keep the last occurrence of duplicatesresult=df.drop_duplicates(keep='last')print(result)

Output

 Name Age City 1 Bob 30 LA 2 Alice 25 NY 3 David 40 Chicago

The keep='last' parameter ensures the last occurrence of each duplicate is retained instead of the first.

3. Dropping All Duplicates

To remove all rows with duplicates, use keep=False. This keeps only rows that are entirely unique.

Python

importpandasaspddf=pd.DataFrame({'Name':['Alice','Bob','Alice','David'],'Age':[25,30,25,40],'City':['NY','LA','NY','Chicago']})# Drop all duplicatesresult=df.drop_duplicates(keep=False)print(result)

Output

 Name Age City 1 Bob 30 LA 3 David 40 Chicago

With keep=False, all occurrences of duplicate rows are removed, leaving only rows that are entirely unique across all columns.

4. Modifying the Original DataFrame Directly

To modify the original DataFrame directly without creating a new one, use inplace=True.

Python

importpandasaspddf=pd.DataFrame({'Name':['Alice','Bob','Alice','David'],'Age':[25,30,25,40],'City':['NY','LA','NY','Chicago']})# Modify the DataFrame in placedf.drop_duplicates(inplace=True)print(df)

Output

 Name Age City 0 Alice 25 NY 1 Bob 30 LA 3 David 40 Chicago

Using inplace=True modifies the original DataFrame directly, saving memory and avoiding the need to assign the result to a new variable.

Pandas DataFrame.dropna() Method

Kartikaybhutani

Improve

Article Tags :

Pandas dataframe.drop_duplicates()

Python dataframe.drop_duplicates() : Examples

1. Dropping Duplicates Based on Specific Columns

2. Keeping the Last Occurrence

3. Dropping All Duplicates

4. Modifying the Original DataFrame Directly

Similar Reads

Thank You!

What kind of Experience do you want to share?