I need a design approach for a feature store which accepts input data in the forms of CSV, EXCEL, JSON etc.
The backend stack we support is Java, Spring boot and oracle DB.
Note :
- Each file corresponds to only one single type of object and each of the object doesn't have any relationship with the other objects(that get uploaded as part of other files). Basically every object will have complete information about it.
- We want to support the subsequent uploads(CSV/EXCEL/JSON) for incremental data ingestion. for example : object_file.csv was first uploaded with 1000 records, and after sometime the user comes with the next set of data(next 1000 records) with the same file name and it will be incremental data that should be ingested as well.
- We have at least 100 such objects that want to store as part.
- Most of my operations on data will be reads and inserts. But sometimes data has to be updated as well (5% of the times)
My approach : I thought of creating new table for each type of object and for subsequent upload just add new entries to the existing table. And when no of attributes increase in any file, then alter the table to add those new attributes/column-headers.
My concerns :
- Is it a good approach to create new table on the fly from java ?
- How will I manage when the attributes increase in subsequent uploads of the data? for example : first time the CSV for an object had only 5 columns but after some days the attributes/headers increased in numbers and became 10. (ideally I may have to alter the existing table to add the new attributes).
- I don't have any idea about the data that comes from the CSV/JSON file, so how to decide for PKs/ surrogate keys ?
- If I don't create any PK, then is it going to hit performance, as each of the table can have at least 100000 - 150000 records in them.