I have a eCommerce like system which produces 5000 user events (of different kind like product search/product view/profile view) per second
Now for reporting business users would like to view the different dimensions like
1. Find user session(say of 30 mins) in any given x days.. 2. Find number of searches/product/profile views happened in any given x days.
There are two parts involved in above use case
1. Computation/Aggregation of events data 2. How to store the data efficiently.
First thoughts and question on storage part as this will decide how to compute/aggregate the data
- I believe i should store the data per day as this is the most granular unit data can be asked for. It can be stored in elastic so that it is searchable and can aggregated for x days.
- Should I store the each dimension(session/searches/product views) separately or session should be the top level object which should internally(nested) contain the other related data(like searches/product view etc) in that session. Now if product views query is asked , it can be served from here itself.
Second how to design the aggregation/computation part. Here are my thoughts about
- Collector(say java based) will put it scalable messaging system like partitioned kafka queue.
- Multiple spark consumer will process the events from queue so that computation can be done in parallel and near to real time
- Now multiple spark consumers process the events for aggregation of events per user per 30 mins and store it in elastic which can be searched through kibana dashboard.
Aggregation can be computed like this.
4a. Get the event from queue , get the user_id, create the in memory map till you get the event for 30 mins where key will be user_id and value will be session object containing session object which internally contains search/product view/profile view for that session.
4b. Once next event is after 30 event, push the object in map to elastic.
Is my design for storage and aggregation on right path ?