Dynamically transform EAV data into standard SQL table

Question

I use MariaDB to store data using EAV model. I need EAV, because each attribute of each entity can have its own validity (validFrom, validTo). For example, I have entity person and I track changes made to their homeCity and degree attributes:

attributes table:

attributeId name ------------------------ 1 'homeCity' 2 'degree'

values table:

valueId entityId attributeId value validFrom validTo ------------------------------------------------------------------ 1 1 1 'London' 2022-01-01 2022-12-31 2 1 1 'Paris' 2023-01-01 NULL 3 1 2 'BSc' 2022-01-01 2022-06-30 4 1 2 'MSc' 2022-07-01 NULL 5 2 1 'Madrid' 2022-01-01 NULL

The problem is that I want to query the data conveniently (using WHERE clauses, JOINs with other tables etc.), which is possible only if it is a standard SQL table with rows and columns looking like this:

entityId homeCity degree ---------------------------- 1 'Paris' 'Msc' 2 'Madrid' NULL

So I need to programmatically create/update such table from EAV model data after each EAV data update (or maybe create/update it only when user really requests the data).

Another problem is that such table would be valid only for certain date (typically for the current date). If I want to query data valid to a different date (e.g. 2022-01-01), I need to dynamically create another table:

entityId homeCity degree ---------------------------- 1 'London' 'Bsc' 2 'Madrid' NULL

Is it a good approach to dynamically create a table if user asks to view data valid to a distinct date?

Another solution would be to store all validity data (i.e. all versions of given entity) in a single dynamically created table:

id entityId homeCity degree validFrom validTo -------------------------------------------------------- 1 1 'London' 'MSc' 2022-01-01 2022-06-30 2 1 'London' 'BSc' 2022-07-01 2022-12-31 3 1 'Paris' 'BSc' 2023-01-01 NULL 4 2 'Madrid' NULL 2022-01-01 NULL

In the query I just need to specify filter on validity: WHERE '2022-01-01' BETWEEN validFrom AND IFNULL(validTo, '9999-01-01'). This solution is ideal for querying, but creating the table is slower, because it is necessary to compute all possible date versions of the entity.

Is this a good approach or is it better not to use EAV model at all?

_{Some details on my business domain: It is an app of school agenda (students, classes, etc). Each entity has about 50 attributes, almost every of them is potentially changeable (surname, e-mail, nationality...). Some of them only change rarely (home address), some change regularly (e.g. every year - school class which student attends). Attributes usually change individually, but it is not a rule - e.g. when home address changes, ale parts of it change at once (street, city, etc.). I don't expect many users editing the entity at the same time.}

It is common and reasonable to regularly export denormalized tables for data warehousing style analytics tasks. What kind of schema to use depends on your needs. But for transaction processing tasks, such tables would be problematic. You could use triggers to keep them in sync with the source of truth, but this could still cause issues with some serialization / isolation levels. In your place, I'd think more about whether EAV is appropriate. What specific issue are you addressing with EAV? Is your main motivation user-defined attributes, or having an append-only data model to track changes? — amon, CommentedDec 2, 2023 at 13:04
@amon It is both - user defined attributes and also tracking changes. I store data with up to 10 years of history. They include people changing names, home address etc. I then need to query the data in normalized form very often - not only for some analytic reports, but for implementing regular app features. But a small delay in sync is tolerable I believe. — user14967413, CommentedDec 2, 2023 at 14:12
This question is missing some numbers. How many of those attributes with a uniform validity interval do you expect? Just the two of your example? 20? 200? 2000? Does the validity really change dynamically at the attribute level, or do you have such changes for many attributes at the same point in time at once? Do you expect many queries to be applied for the same point in time, or is this also quite dynmically? Give us context! If EAV is better or worse suited for this system cannot be answered sonsibly in thin air — Doc Brown, CommentedDec 2, 2023 at 16:58
@DocBrown Business domain of the app is school agenda (students, classes, etc). Each entity has about 50 attributes, almost every of them is potentially changeable (surname, e-mail, nationality...). Some of them only change rarely (home address), some change regularly (e.g. every year - school class which student attends). Attributes usually change individually, but it is not a rule - e.g. when home address changes, ale parts of it change at once (street, city, etc.). I don't expect many users editing the entity at the same time. — user14967413, CommentedDec 3, 2023 at 1:19

Doc Brown · Accepted Answer · 2023-12-04 10:51:34Z

In a comment, you wrote

Business domain of the app is school agenda (students, classes, etc). Each entity has about 50 attributes, almost every of them is potentially changeable (surname, e-mail, nationality...). Some of them only change rarely (home address), some change regularly (e.g. every year - school class which student attends). Attributes usually change individually, but it is not a rule - e.g. when home address changes, ale parts of it change at once (street, city, etc.). I don't expect many users editing the entity at the same time

which tells me several things which are in no way obvious from your original question:

your entities are too large. I expect a relational model with 3NF in mind to have smaller entitities. Since you are considerung a redesign of your data structures, you can probably resolve each of your current entities into several parent-child tables .
once you have splitted your current entities into smaller ones, it will most probably be sufficient to model validity at the entity level, not at the attribute level.
For example, to keep track of adresses over time, you don't have to store the the information when the city, the street or the zip code changes individually. Same holds for "contact information" of a person (combination of email addresses and phone numbers).
Other kind of information is probably not required to be kept in a historic fashion, the parent entity of a person (with child entities adress and contact information) may contain a unique ID for this person which models its identity and must not be changed over time.
your data model does not have the requirement of incorporating different attributes for each entity record. The attributes are all the same for all records of one entity.

So in short: you do not need an EAV, you only need a uniform approach of modeling history in your relational model for certain entities. Looking through older Q&As on this site and Stackoverflow, I found these might be helpful as a good start:

Database design - maintaining versions of an object

https://stackoverflow.com/questions/3874199/how-to-store-historical-data

The result will be a model which allows you to query data exactly the way you asked for it.

Finally, since it seems you have an existing system, how do you get from here to there? One approach may be indeed to create a new "classic" relational model in a separate database (or DB schema) based on the ideas from above, and then fill the classic model from your current EAV-based system - once initially, and then dynamically by triggers which are fired whenever a change to the EAV records happen. From what you wrote in the question, I guess that is what you already had in mind.

At a later point in time, you may consider to throw your EAV over board and rewrite your application using the "classic" relational model exclusively. But beware, I don't know if that is economically feasible for you - the effort of such big rewrites are often underestimated.

This is a very good point. I was considering to track history on main entity level (versions of person as whole) and on attribute level (versions of person's attributes). The best solution is to split entity into subentities and track history on subentities level. With attribute level solution, the problem is I have to join every attribute to the main table separately. With main entity level solution, when entity changes, whole row with all columns has to be duplicated, which leads to data redundancy. With subentities level solution, both of these problems are solved. — user14967413, CommentedDec 5, 2023 at 12:05
Fortunately, I currently don't use EAV model. I use relation model, just with limited history tracking possibilities. Thus splitting into subentities is much easier for me than introducing EAV model. Maybe just UI design of subentities history tracking is a challenge. I will consider the final solution and choose the accepted answer then. :-) — user14967413, CommentedDec 5, 2023 at 12:15
The issue here is that user must understand the concept of subentities. If he wants to enter e-mail valid in some period of time, he doesn't only care about historic changes of e-mail, but about historic changes of the whole contacts subentity. E.g. if he wants to enter e-mail valid from '2023-03-01', he creates a new subentity instance valid from '2023-03-01' and containing this e-mail. However there already is another subentity instance valid from '2023-04-01' which represents changed telephone number. User must understand that he has to change the e-mail in both instances. — user14967413, CommentedDec 5, 2023 at 12:32

Steve · Accepted Answer · 2023-12-02 14:13:48Z

Is this a good approach or is it better not to use EAV model at all?

I think I'd be on strong ground in saying that most practitioners disfavour EAV, and find that its challenges outweigh its sole advantage which is the ability to add or remove attributes dynamically without changing the explicit schema of the database.

I worked with an EAV database several years ago, apparently written by a postgraduate student who the company had commissioned some years earlier. Not only was it rather more difficult than the norm to navigate and devise queries for, but it's performance was extremely poor and sluggish for workloads and data volumes that were quite ordinary.

And in practice, although the database could handle attributes being added dynamically, the screens and paginated reports could not adapt dynamically. They all still relied on a kind of latent database schema fixed and encoded in the EAV data, and whenever the available attributes changed these peripheral artefacts had to be considered and reworked just as much as if the explicit schema of the database had changed.

So the costs were very real and the advantages were a complete mirage.

My advice would be to model your attributes in standard ways. If you need a history per attribute, then use a table per attribute and join.

If you need to add or remove attributes, then accept the need to alter the explicit database definition by adding or removing the associated tables.

I'd agree, EAV databases are usually a case of "if everything you have is a hammer, …". But now we have a really rich ecosystem of mature NoSQL databases (including JSON support in nearly all RDBMS), so there tend to be better ways of dealing with more dynamic data. And in a lot of cases, there actually is a reasonable schema. — amon, CommentedDec 2, 2023 at 17:32
@amon, a JSON column is alright if you just need a dumping ground for residual attributes, but honestly, the need to define structure for data storage is one of the essential and irreducible problems in automatic data processing. Like I say in my answer, if you don't define that structure explicitly (and keep it maintained), you'll instead end up relying on a latent structure - which has all the same rigidities of the explicit structure, but none of the visibility. It just doesn't work like it's proponents seem to think. — Steve, CommentedDec 2, 2023 at 19:56
I agree I should use standard columns for most attributes and only certain attributes should track historic changes. However it is not convenient to use "table per attribute" solution with separate table for every history tracking attribute. Rather I would store all attributes values in the same values table. It is no problem to join them to the main table, as I showed in my answer. Of course, it is necessary to consider that values table can store different data types - there can be special column for each data type. — user14967413, CommentedDec 5, 2023 at 12:38
@user14967413, as I say, the general experience of those who use databases commercially, is that the "convenience" of EAV is a mirage. Because EAV seems to make database schema design easier, and because that schema design tends to be amongst the first things that are designed in a database application, the severe pain that EAV causes always occurs further down the road in a different place than where its modest advantages appeared. If it is inconvenient to model your problem in standard ways, then that is either because it is a hard problem, or because there is something absurd about it. — Steve, CommentedDec 5, 2023 at 13:15
@user14967413, if they aren't database programmers then they won't be competent in the essential skills of database programmers, such as storage schema design and maintenance. From my point of view, you've revealed that we are now indeed in the realm of the "absurd" I referred to earlier - you have grandiose ideas that EAV will mean you don't have to hire programmers anymore, but instead replace them with "trained" staff who prefer "no code", who will then maintain the latent schema encoded in the EAV attributes. Utter pie in the sky. — Steve, CommentedDec 5, 2023 at 18:01

user14967413 · Accepted Answer · 2023-12-05 11:56:19Z

I realized that EAV model can be transformed into a standard SQL table "on the fly" by creating derived table. In the derived table I use "brute force" join solution where every attribute from EAV is loaded into separate table column by separate LEFT JOIN.

Here is an example using WITH for derived table definition. I assume there is a static table personsMain which defines a list of persons.

WITH -- derived table definition persons AS ( SELECT p.id, v1.value AS 'homeCity', v2.value AS 'degree' FROM personsMain p LEFT JOIN `values` v1 ON v1.entityId = p.id AND v1.attributeId = 1 AND CURDATE() BETWEEN v1.validFrom AND IFNULL(v1.validTo, '9999-01-01') LEFT JOIN `values` v2 ON v2.entityId = p.id AND v2.attributeId = 2 AND CURDATE() BETWEEN v2.validFrom AND IFNULL(v2.validTo, '9999-01-01') ) -- any custom query using the derived table SELECT * FROM persons GROUP BY homeCity

This solution is not ideal.

Pros

no need to create a table from EAV data and maintain it - table gets created on the fly
data loaded from EAV is always up to date
it is possible to work with multiple such derived tables in the query (just define them in the WITH part)

Cons

each query is very large and contains repetitive code
this approach is slow when there is many attributes loaded from EAV - it is necessary to have majority of attributes in the main table (personsMain) and load from EAV only those attributes that really change often
this approach it is very slow when working with multiple derived tables and joining them (unless you restrict number of rows by filter condition)

"each query is very large and contains repetitive code" - define a VIEW! — Bergi, CommentedDec 5, 2023 at 13:33
@Bergi Good idea, however I need to query EAV data valid to an arbitrary date. I.e. I need to replace CURDATE() with somehing like 2023-01-01. VIEW unfortunately doesn't accept arguments, so I really need to specify the literal query. — user14967413, CommentedDec 5, 2023 at 13:43
Hm, you could create a view representing all time ranges, with a valid column, that you'd then filter on in each query, however the JOIN for that range intersection would be rather complicated. Or alternatively, you could create an SQL FUNCTION that takes the date as an argument. In either case, I'm not certain how well the optimiser can inline them in your query, you'd need to try it out. — Bergi, CommentedDec 5, 2023 at 13:49
@Bergi I was also considering using FUNCTION, however my RDBMS (MariaDB) doesn't support table-valued functions (functions returning a table). — user14967413, CommentedDec 5, 2023 at 14:00

Stack Exchange Network

Dynamically transform EAV data into standard SQL table

3 Answers 3

Pros

Cons

Linked

Hot Network Questions

Dynamically transform EAV data into standard SQL table

3 Answers 3

Pros

Cons

Linked

Related

Hot Network Questions