Python Pandas - GroupBy



Pandas groupby() is an essential method for data aggregation and analysis in python. It follows the "Split-Apply-Combine" pattern, which means it allows users to −

  • Split data into groups based on specific criteria.

  • Apply functions independently to each group.

  • Combine the results into a structured format.

In this tutorial, we will learn about basics of groupby operations in pandas, such as splitting data, viewing groups, and selecting specific groups using an example dataset.

Introduction to GroupBy Operations

Every groupby() operation involves three key steps, splitting data into groups based on some criteria, apply functions independently to each group, and then merge the results back into a meaningful structure.

In many situations, we apply some functions on each splitted groups. In the apply functionality, we can perform the following operations −

Split Data into Groups

Pandas objects can be split into groups based on any of their column values using the groupby() method.

Example

Let us now see how the grouping objects can be applied to the Pandas DataFrame using the groupby() method.

 # import the pandas library import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) # Display the Original DataFrame print("Original DataFrame:") print(df) # Display the Grouped Data print('\nGrouped Data:') print(df.groupby('Team')) 

Following is the output of the above code −

 Original DataFrame: 
TeamRankYearPoints
0Riders12014876
1Riders22015789
2Devils22014863
3Devils32015673
4Kings32014741
5kings42015812
6Kings12016756
7Kings12017788
8Riders22016694
9Royals42014701
10Royals12015804
11Riders22017690
Grouped Data: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f1a11545060>

GroupBy with Multiple Columns

You can group data based on multiple columns by applying a list of column values to the groupby() method.

Example

Here is an example where the data is grouped by multiple columns.

 # import the pandas library import pandas as pd # Create a DataFrame ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) # Display the Grouped Data print('Grouped Data:') print(df.groupby(['Team','Year']).groups) 

Its output is as follows −

 Grouped Data: {('Devils', 2014): [2], ('Devils', 2015): [3], ('Kings', 2014): [4], ('Kings', 2016): [6], ('Kings', 2017): [7], ('Riders', 2014): [0], ('Riders', 2015): [1], ('Riders', 2016): [8], ('Riders', 2017): [11], ('Royals', 2014): [9], ('Royals', 2015): [10], ('kings', 2015): [5]} 

Viewing Grouped Data

Once you have your data split into groups, you can view them using different methods. One of the simplest ways is to view how it has been internally stored using the .groups attribute.

Example

The following example demonstrates how to view the grouped data using the using the .groups attribute.

 # import the pandas library import pandas as pd # Create DataFrame ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print('Viewing Grouped Data:') print(df.groupby('Team').groups) 

Its output is as follows −

 Viewing Grouped Data: {'Devils': [2, 3], 'Kings': [4, 6, 7], 'Riders': [0, 1, 8, 11], 'Royals': [9, 10], 'kings': [5]} 

Selecting a Specific Group

Using the get_group() method, we can select a specific group.

Example

The following example demonstrates selecting a group from a grouped data using the get_group() method.

 # import the pandas library import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') # Display the Selected Data print('Selected Group Data:') print(grouped.get_group(2014)) 

Its output is as follows −

 Selected Group Data: 
TeamRankYearPoints
0Riders12014876
2Devils22014863
4Kings32014741
9Royals42014701
Advertisements
close