How to group identical values and count their frequency in Python?

Question

Newbie to analytics with Python so please be gentle :-) I couldn't find the answer to this question - apologies if it is already answered elsewhere in a different format.

I have a dataset of transaction data for a retail outlet. Variables along with explanation are:

section: the section of the store, a str;
prod_name: name of the product, a str;
receipt: the number of the invoice, an int;
cashier, the number of the cashier, an int;
cost: the cost of the item, a float;
date, in format MM/DD/YY, a str;
time, in format HH:MM:SS, a str;

Receipt has the same value for all the products purchased in a single transaction, thus it can be used to determine the average number of purchases made in a single transaction.

What is the best way to go about this? I essentially want to use groupby() to group the receipt variable by its own identical occurrences so that I can create a histogram.

Working with the data in a pandas DataFrame.

EDIT:

Here is some sample data with header (prod_name is actually a hex number):

 section,prod_name,receipt,cashier,cost,date,time electronics,b46f23e7,102856,5,70.50,05/20/15,9:08:20 womenswear,74558d0d,102857,8,20.00,05/20/15,9:12:46 womenswear,031f36b7,102857,8,30.00,05/20/15,9:12:47 menswear,1d52cd9d,102858,3,65.00,05/20/15,9:08:20

From this sample set I would expect a histogram of receipt that shows two occurrences of receipt 102857 (since that person bought two items in one transaction) and one occurrence respectively of receipt 102856 and of receipt 102858. Note: my dataset is not huge, about 1 million rows.

$\begingroup$Done, added some sample data.$\endgroup$
– new_analyst
CommentedApr 22, 2016 at 4:05 — new_analyst, CommentedApr 22, 2016 at 4:05

Emre · Accepted Answer · 2016-04-22 17:39:35Z

From this sample set I would expect a histogram of receipt that shows two occurrences of receipt 102857 (since that person bought two items in one transaction) and one occurrence respectively of receipt 102856 and of receipt 102858.

Then you want:

df.groupby('receipt').receipt.count()

receipt 102856 1 102857 2 102858 1 Name: receipt, dtype: int64

since the result is no longer a dataframe, how do we filter this to show only the values that have a count of more than 1? — Nikhil VJ, CommentedJul 18, 2018 at 15:51
You can still do things like s[s>1], where s=df.groupby('receipt').receipt.count() — Emre, CommentedJul 18, 2018 at 18:24

Ryan · Accepted Answer · 2016-04-22 21:50:25Z

3

I'm putting together some tutorials around data wrangling. Maybe my jupyter notebook on github will help. I think that it is the key is modifying the line:

df.groupby('male')['age'].mean()

to be:

df.groupby('reciept')['prod_name'].count()

To group by multiple variables this should work:

df.groupby(['reciept','date'])['reciept'].count()

edited Apr 22, 2016 at 21:50

answered Apr 21, 2016 at 20:40

Ryan

7163 silver badges11 bronze badges

$\begingroup$Thanks for that. However, sometimes receipt is repeated (when the date is different too). Therefore, we are aggregating all the different occurrences of receipt even on different dates, when we really want to know the number of receipt by transaction - however there is not a unique transaction ID variable. I don't think receipt repeats on the same day - can we use date as a way to group? df.groupby('reciept')['date'].count() gives the same result as df.groupby('reciept')['prod_name'].count()$\endgroup$
– new_analyst
CommentedApr 22, 2016 at 4:27
$\begingroup$Add date as a parameter inside the groupby call. Edited my answer above to group by multiple vars.$\endgroup$
– Ryan
CommentedApr 22, 2016 at 21:51

Add a comment |

Zephyr · Accepted Answer · 2020-07-31 15:36:13Z

From what I can understand is that you would need a histogram of your receipt no. You can try something like this

import pandas as pd data = np.read_csv("your_file_path.csv") data.groupby(["receipt"])receipt.count().sort_values(ascending=False).head(20).plot.bar()

This will give you bar plots of most repeating billing numbers (20 most repeating) Change the number in the head function to get more or less.

Stack Exchange Network

How to group identical values and count their frequency in Python?

3 Answers 3

Hot Network Questions

How to group identical values and count their frequency in Python?

3 Answers 3

Related

Hot Network Questions