Aggregation and storage system design for user event processing?

Question

I have a eCommerce like system which produces 5000 user events (of different kind like product search/product view/profile view) per second

Now for reporting business users would like to view the different dimensions like

 1. Find user session(say of 30 mins) in any given x days.. 2. Find number of searches/product/profile views happened in any given x days.

There are two parts involved in above use case

 1. Computation/Aggregation of events data 2. How to store the data efficiently.

First thoughts and question on storage part as this will decide how to compute/aggregate the data

I believe i should store the data per day as this is the most granular unit data can be asked for. It can be stored in elastic so that it is searchable and can aggregated for x days.
Should I store the each dimension(session/searches/product views) separately or session should be the top level object which should internally(nested) contain the other related data(like searches/product view etc) in that session. Now if product views query is asked , it can be served from here itself.

Second how to design the aggregation/computation part. Here are my thoughts about

Collector(say java based) will put it scalable messaging system like partitioned kafka queue.
Multiple spark consumer will process the events from queue so that computation can be done in parallel and near to real time
Now multiple spark consumers process the events for aggregation of events per user per 30 mins and store it in elastic which can be searched through kibana dashboard.
Aggregation can be computed like this.
4a. Get the event from queue , get the user_id, create the in memory map till you get the event for 30 mins where key will be user_id and value will be session object containing session object which internally contains search/product view/profile view for that session.
4b. Once next event is after 30 event, push the object in map to elastic.

Is my design for storage and aggregation on right path ?

Ewan · Accepted Answer · 2020-01-06 09:36:43Z

Obviously we are guessing a bit but I would say No.

You say you want to use these events to product reports, so splitting them in advance doesn't make sense. You dont know what kind of report will be needed.

I would put them all in a standard relational DB Table

Event Id Type UserId Action

Since we are only ever inserting this should be fast

Then have this pushed to a Data Warehouse for Querying/Reporting where dimensions etc can be defined.

Don't think I said that I am splitting the events in advance. Also I as you said put them in DB, its same as putting in Kafka message topic. So insertion is not a problem. Primarily I want to confirm is aggregation approach correct and how should I store aggregated events so that data insights and visualization can be done ? — M Sach, CommentedJan 6, 2020 at 11:14
"I believe i should store the data per day" I guess what im getting at is you don't need to do anything custom. a standard MSSQL + SSAS setup where you leave the dimensions, aggregations and reporting to the user does the job — Ewan, CommentedJan 6, 2020 at 11:25

Martin K · Accepted Answer · 2020-01-06 10:07:29Z

What are your requirements for

Size of the data (is that 5000/second?)
Latency (until data can be viewed, is it 5 seconds, 5 minutes, next day)
Retention (how long to keep it)
Archival (does it get removed from view but stored somewhere for long term analysis)

As a general principle, you should

Log unprocessed events with rich details as they are observed
Delay processing and aggregation to run in efficient batches
Choose a stack that meets your requirements

If you haven't discovered all the requirements, log to text files and use the ELK stack for processing and reporting as a first iteration

1. Yes its 5000 events per second. Corrected in post. 2. Data can be viewed after 24 hours. 3. Need to keep 3 years of data — M Sach, CommentedJan 6, 2020 at 11:16

Stack Exchange Network

Aggregation and storage system design for user event processing?

2 Answers 2

Hot Network Questions

Aggregation and storage system design for user event processing?

2 Answers 2

Related

Hot Network Questions