Skip to main content

Questions tagged [big-data]

2votes
1answer
239views

What is an optimal system design for tracking product views per user that is scalable?

I have a web application that contains products and users. There are 10,000+ products and 100,000+ users to give a sense of the scale that's required. For some application specific reasons, I need to ...
kitkat's user avatar
0votes
1answer
94views

Data file ingestion with minio and kafka

I want to collect a lot of files (file data + metadata) from local servers to a central server. Files are important, need to ensure that no files are lost Local servers: implement a collector to ...
kietheros's user avatar
3votes
1answer
978views

How to store a huge volume of time-series datapoints in an efficient way?

We have an application producing 5k-10k datapoints per second. Each datapoint has more than one metric, alongside its time of creation. We are looking for an efficient, scalable way to store this huge ...
Paul Benn's user avatar
5votes
1answer
1kviews

How do you perform accumulation on large data sets and pass the results as a response to REST API?

I have around 125 million event records on s3. The s3 bucket structure is: year/month/day/hour/*. Inside each hour directory, we have files for every minute. A typical filename looks like this: ...
Namah's user avatar
1vote
0answers
478views

How to (simply) architecture a way to ingest multiple types of large files, process them, and send data in chunks to web services?

Note: All of this would be in AWS Hi everyone, What would you guys suggest for building something that: Takes in several different input file types (ex: csv, json, jsonl, xml, .gz, ...) That can be ...
user avatar
0votes
0answers
63views

Should aggregated data include meta data?

I want to create a aggregation job that executes a big db query and flush it into BigQuery. My question is should I include only the id of the entities (campaign id, advertiser id, user id) or should ...
Avi L's user avatar
0votes
1answer
95views

A program design question: Good idea using HDFS in c for reading large data?

I have mainly three groups of CSV files (each file is divided into several small files): First group of CSV files have 600+ GB in total (MAYBE 200+ GB if in int, cause CSV calculates by char right?), ...
heisthere's user avatar
2votes
2answers
3kviews

From Oracle to Apache Parquet : how to handle eventual consistency?

I have an existing production Oracle Database. However, there are performance issues for certain kind of operations, because of the volume of the data, or the complexity of queries. That's why I ...
Klun's user avatar
1vote
1answer
882views

Load for Date dimension table of a warehouse

I have a general question about loading data into a data warehouse (DW). This is basically a followup to an older question of mine. I have a general understanding problem about fill the [Date] ...
Steffen Mangold's user avatar
3votes
2answers
167views

Enterprise application warehousing and relational database

I have a general question about design pattern for an enterprise application. I read a lot about it but its actually hard to find an answer because most you find it rater about how to design a data ...
Steffen Mangold's user avatar
3votes
2answers
2kviews

Aggregation and storage system design for user event processing?

I have a eCommerce like system which produces 5000 user events (of different kind like product search/product view/profile view) per second Now for reporting business users would like to view the ...
M Sach's user avatar
1vote
3answers
292views

Query 30 million HTML documents

I have 30-ish million html documents in a file system. There is no emergency, the files are in a reasonable directory tree, it's not breaking the file system. But I'd like to be able to organize and ...
Martin K's user avatar
0votes
0answers
85views

Generating fake number for a 25 digit PII number in a file containing millions of rows

I have to expose some sensitive data containing a PII column that has a 25 digit number. Rest of the columns aren't PII data. This is done such that the data can be safely shared to the larger ...
stormfield's user avatar
2votes
0answers
29views

How to design a report processing model using Spark in the most efficient way

I have a reporting system which gets time-series data from numerous meters (here I am referring it as raw_data) I need to generate several reports based on different combinations of the incoming ...
Remis Haroon - رامز's user avatar
2votes
2answers
1kviews

Designing a big data web app

How do you design a website that allows users to query a large amount of user data, more specifically: there are ~100 million users with ~100TB of data, data is stored in HDFS (not a database) number ...
Minh Thai's user avatar

153050per page
close