Adding unstructured data to a relational database (PostgreSQL)

Question

I'm building a project and I only know the MERN stack. Everywhere I see the general consensus is that Mongo is not good and should only be used in very specific use cases. More importantly, I also want to expand my skills so I thought this would be a great time to try out PostgreSQL since there also seems to be a ton of jobs requiring this so I can kill two birds with one stone.

Here's my problem.

I'm building an application with unstructured data. What I mean is that I'm building something like a counter where users can input something and it will be counted how many times it has been inputted that day. To give an example lets say it's an app that tracks the snacks you've eaten throughout the day. In Mongo the structure would look like this

{ user_id: "someId", date: 1586822400, snacks: { "orange": 5, "kit-kat": 100, "peanuts": 30 } }

Then if the user were to again add "peanuts", it'll increment the value of peanuts by 1 so it'll say 31 instead of 30. However, if the user was to add a random snack called whizits, it'll upsert a key-value pair so the db will look like this

{ user_id: "someId", date: 1586822400, snacks: { "orange": 5, "kit-kat": 100, "peanuts": 30, "whizits": 1 } }

Then once the day ends, a new document will be created with an empty snacks object and it'll start all over. This is unstructured because I have no idea what snacks the user will end up adding. Sure, there will probably be a few common snacks that will be added like fruits and chips and whatever else, but there will be the occasional whizits.

How can I create a schema like this in Postgres? The only way I can think of this is to have a table with the snacks column just being a really long string of the snacks that are split by spaces or something so I can count all the snacks. So it would look like

+---------+------------+-------------------------------------------------+ | user_id | date | snacks | +---------+------------+-------------------------------------------------+ | someId | 1586822400 | "orange orange orange orange orange kit-kat ... | +---------+------------+-------------------------------------------------+

This just seems like a horrible way of doing this so any other ideas are appreciated!

What's important is I want writes to be very fast because I want to update in real time, like imagine if there were a ton of users entering snacks at the same time. I'm using web sockets to relay it as fast as possible. Reads won't happen as frequently since the only time a read will occur is when the user goes to their dashboard to see the statistics of how much snacks they ate and other data.

PS. Maybe this is one of those cases where Mongo or any other NoSQL db would do better? I posted my project elsewhere asking what db to use and people said to use a relational db since this is relational data since users have snacks, etc. Maybe it's because I'm new to this but I don't see how I can make this work. If you have any other suggestions I'm all ears!

Just curious as to why the down votes, I tried to give examples and explain what I'm tryna do. If any more info is required I'll be happy to provide! — WildWombat, CommentedApr 14, 2020 at 20:14
You're still "thinking in Mongo." You have to think like a relational database. — Robert Harvey, CommentedApr 15, 2020 at 20:34

amon · Accepted Answer · 2020-04-14 21:08:26Z

The usual approach would be a table that links snacks to users, perhaps something like this:

CREATE TABLE snack ( user REFERENCES users(user_id) ON DELETE CASCADE, snack TEXT NOT NULL, count INTEGER NOT NULL, PRIMARY KEY (user, snack) );

For time-series data, we might rather store individual entries that can be aggregated later with some query:

CREATE TABLE snack ( user REFERENCES users(user_id) ON DELETE CASCADE, date TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP, snack TEXT NOT NULL ); SELECT create_hypertable('snack', 'date', chunk_time_interval=>'1 day');

(however, my SQL may be rusty).

The core insight is that we have a n:m relation between users and their snacks, and this relation has associated data – the count in the aggregated version, and the timestamp in the time series version. In the aggregated version, the primary key constraint enforces that the user:snack relations are unique and not null.

The model as written is necessary for correctness (although you could use VARCHAR instead of TEXT, or rather use a foreign key reference to a table of snack names), and is probably fast to query. There is also no particular reason why the aggregated model should be slow to write, especially when updating an existing relation. However, Postgres' internals are oriented towards durability, less so for maximum throughput at all costs. A simple key–value store can model the same data, but would e.g. lose the foreign key constraint.

In theory you could also use Postgres' JSON operators to implement a data model that feels more natural to you, but in practice the relational design will be easier to work with in SQL and will likely be easier for the database engine to handle efficiently.

For the time-series version, this is likely to offer competitive performance when using the TimescaleDB extension (indicated above by creating a hypertable). This effectively creates a new table chunk for each time interval, e.g. one day, and ensures that the data can later be aggregated efficiently. You do not have to commit to an aggregation interval in advance, and can defer this decision until the query is performed (which should be rare, according to your assumptions).

Thanks! I have few questions. I assume you meant to write snack instead of name because you made the primary key (user, snack) and I don't see a snack column. I mentioned that I've bucket the data into days so the user can see their snacking avg per day, so would the snack table also have a day INTEGER NOT NULL column and make the primary key PRIMARY KEY (user, snack, day). Would this be a problem if I wanted to bucket it into hourly instead? And also what would be the query itself to enter the data because if that snack for that day for that user exists, update, otherwise create. — WildWombat, CommentedApr 14, 2020 at 20:43
@WildWombat I had missed the bit about time series, and have updated accordingly. Postgres can handle time series very efficiently. Bucketing the data in advance is probably not appropriate – stick to a nice append-only workload and only aggregate the data upon the occasional query. This should be fast if you add an index on (user, date), I think. — amon, CommentedApr 14, 2020 at 21:10
Gotcha, makes a lot of sense. I thought about going the append-only route but I thought I might end up with too much data? To be transparent the proj I'm hoping to build is an analytics tool but it functions exactly how I described in the post. Data such as the browser, screen size, referrer will be sent through web sockets to the back end. Only difference is, there'll be a lot of writes/data. If each user has 100 visitors per day, that's 36,500 rows per year. Not to mention users will have different pages and I hope to have more than 1 user. Do you not think it'd be an issue? — WildWombat, CommentedApr 14, 2020 at 21:29
@WildWombat That's still just a few MB per user per year. Keeping scalability in mind might be sensible, but should not prevent you from creating a working system first. A scalable analytics system that has to deal with many TB per year or thousands of writes per second would likely benefit from a sharded NoSQL approach for capturing raw events, and only loading aggregated data into Postgres (like summaries with different granularities). — amon, CommentedApr 22, 2020 at 15:36

Stack Exchange Network

Adding unstructured data to a relational database (PostgreSQL)

1 Answer 1

Hot Network Questions

Adding unstructured data to a relational database (PostgreSQL)

1 Answer 1

Related

Hot Network Questions