Open In App

Pandas dataframe.drop_duplicates()

Last Updated : 25 Nov, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Pandas drop_duplicates() method helps in removing duplicates from the Pandas Dataframe allows to remove duplicate rows from a DataFrame, either based on all columns or specific ones in python.

By default, drop_duplicates() scans the entire DataFrame for duplicate rows and removes all subsequent occurrences, retaining only the first instance being the simple and efficient method. Let’s see a quick example:

Python
importpandasaspddata={"Name":["Alice","Bob","Alice","David"],"Age":[25,30,25,40],"City":["NY","LA","NY","Chicago"]}df=pd.DataFrame(data)display(df)# Removing duplicatesunique_df=df.drop_duplicates()display(unique_df)

Output:

Pandas-dataframe-drop-duplicates

Pandas dataframe.drop_duplicates()

This example demonstrates how duplicate rows are removed while retaining the first occurrence using pandas.DataFrame.drop_duplicates() since it’s commonly used and recommended.

dataframe.drop_duplicates() Syntax in Python :

Syntax: DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)

Parameters:

  • subset: Subset takes a column or list of column label. It’s default value is none. After passing columns, it will consider them only for duplicates.  ( Optional)
  • keep: keep is to control how to consider duplicate value. It has only three distinct value and default is ‘first’. 
    • If ‘first‘, it considers first value as unique and rest of the same values as duplicate.
    • If ‘last‘, it considers last value as unique and rest of the same values as duplicate.
    • If False, it consider all of the same values as duplicates
  • inplace: Boolean values, removes rows with duplicates if True.

Return type: DataFrame with removed duplicate rows depending on Arguments passed. 

Python dataframe.drop_duplicates() : Examples

Duplicate rows can arise due to merging datasets, incorrect data entry, or other reasons. The drop_duplicates() works by identifying duplicates based on all columns (default) or specified columns and removing them as per your requirements. Below, we are discussing examples of dataframe.drop_duplicates() method:

1. Dropping Duplicates Based on Specific Columns

You can target duplicates in specific columns using the subset parameter. This helps when certain fields are more relevant for identifying duplicates.

Python
importpandasaspddf=pd.DataFrame({'Name':['Alice','Bob','Alice','David'],'Age':[25,30,25,40],'City':['NY','LA','SF','Chicago']})# Drop duplicates based on the 'Name' columnresult=df.drop_duplicates(subset=['Name'])print(result)

Output
 Name Age City 0 Alice 25 NY 1 Bob 30 LA 3 David 40 Chicago 

Here, duplicates are removed based solely on the Name column, ignoring the other fields. This is helpful when specific columns uniquely identify rows.

2. Keeping the Last Occurrence

By default, drop_duplicates() retains the first occurrence of duplicates. However, you can retain the last duplicate instead using keep='last'.

Python
importpandasaspddf=pd.DataFrame({'Name':['Alice','Bob','Alice','David'],'Age':[25,30,25,40],'City':['NY','LA','NY','Chicago']})# Keep the last occurrence of duplicatesresult=df.drop_duplicates(keep='last')print(result)

Output
 Name Age City 1 Bob 30 LA 2 Alice 25 NY 3 David 40 Chicago 

The keep='last' parameter ensures the last occurrence of each duplicate is retained instead of the first.

3. Dropping All Duplicates

To remove all rows with duplicates, use keep=False. This keeps only rows that are entirely unique.

Python
importpandasaspddf=pd.DataFrame({'Name':['Alice','Bob','Alice','David'],'Age':[25,30,25,40],'City':['NY','LA','NY','Chicago']})# Drop all duplicatesresult=df.drop_duplicates(keep=False)print(result)

Output
 Name Age City 1 Bob 30 LA 3 David 40 Chicago 

With keep=False, all occurrences of duplicate rows are removed, leaving only rows that are entirely unique across all columns.

4. Modifying the Original DataFrame Directly

To modify the original DataFrame directly without creating a new one, use inplace=True.

Python
importpandasaspddf=pd.DataFrame({'Name':['Alice','Bob','Alice','David'],'Age':[25,30,25,40],'City':['NY','LA','NY','Chicago']})# Modify the DataFrame in placedf.drop_duplicates(inplace=True)print(df)

Output
 Name Age City 0 Alice 25 NY 1 Bob 30 LA 3 David 40 Chicago 

Using inplace=True modifies the original DataFrame directly, saving memory and avoiding the need to assign the result to a new variable.



Next Article

Similar Reads

close