1

I need a design approach for a feature store which accepts input data in the forms of CSV, EXCEL, JSON etc.

The backend stack we support is Java, Spring boot and oracle DB.

Note :

  1. Each file corresponds to only one single type of object and each of the object doesn't have any relationship with the other objects(that get uploaded as part of other files). Basically every object will have complete information about it.
  2. We want to support the subsequent uploads(CSV/EXCEL/JSON) for incremental data ingestion. for example : object_file.csv was first uploaded with 1000 records, and after sometime the user comes with the next set of data(next 1000 records) with the same file name and it will be incremental data that should be ingested as well.
  3. We have at least 100 such objects that want to store as part.
  4. Most of my operations on data will be reads and inserts. But sometimes data has to be updated as well (5% of the times)

My approach : I thought of creating new table for each type of object and for subsequent upload just add new entries to the existing table. And when no of attributes increase in any file, then alter the table to add those new attributes/column-headers.

My concerns :

  1. Is it a good approach to create new table on the fly from java ?
  2. How will I manage when the attributes increase in subsequent uploads of the data? for example : first time the CSV for an object had only 5 columns but after some days the attributes/headers increased in numbers and became 10. (ideally I may have to alter the existing table to add the new attributes).
  3. I don't have any idea about the data that comes from the CSV/JSON file, so how to decide for PKs/ surrogate keys ?
  4. If I don't create any PK, then is it going to hit performance, as each of the table can have at least 100000 - 150000 records in them.

    1 Answer 1

    1

    Is it a good approach to create new table on the fly from java ?

    It is an approach. To evaluate if it is "good" or "bad" one need to understand its consequences:

    • your program requires full "CREATE / ALTER TABLE" access rights on the schema, that means, it looses some of the safety mechanisms a database could provide

    • there is a certain chance this design will make it harder to write programs for further processing. Lots of frameworks or ORMs are assuming a fixed DB schema for each version of the program - by allowing the schema to change dynamically, you rule those frameworks out.

    Alternatively, you may consider to use Oracle's feature to store JSON data directly, without a predefined schema. That way, you could

    • use a parent table File and a child table Record

    • make File records contain some meta information about the original file name and original file format

    • have each Record table contain only three columns: FileID (foreign key), a RecordID key string, and a VARCHAR2 column Data for JSON records. Primary key is the combined key FileID, RecordID.

    • split the original JSON file into individual records and place their content into Data during import

    • convert Csv and Excel files into JSON records as well when the data is imported

    How will I manage when the attributes increase in subsequent uploads of the data?

    By using a JSON column, this becomes trivial. The different records don't need to be structured identical, each record bears the full metadata about its attributes. You just have to make sure any process which queries the data afterwards does not expect each record of the same file to be structured identical.

    I don't have any idea about the data that comes from the CSV/JSON file, so how to decide for PKs/ surrogate keys?

    Well, you wrote you want to allow updates of those records. This is only possible if your process has contraints in place which allow you to identify new as well as existing records and distinguish the two by some kind of key value. Use these constraints to make a decision what you put into RecordID. If you can't, forget about the possibility of updates and put into RecordID whatever you like, maybe simply an ordering number or GUID.

    Let me finally add you can make more informed design decisions when you start analysing what you actually are going to do with that data. Just storing the data in a database is rarely an end in itself. There are usually use cases which process the data afterwards, and the kind of design you pick should focus on supporting these use cases.

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.