I have an existing production Oracle Database. However, there are performance issues for certain kind of operations, because of the volume of the data, or the complexity of queries.
That's why I regularly export/dump each Oracle table to a CSV. Then such CSVs are converted to Parquet files in order to allow very high performance queries with Spark. However, my concern is about loosing strong consistency benefits.
Suppose two tables in Oracle like :
data (id, value, fk_metadata_types_id) metadata_types (id, label)
As of now, I export regularly such two tables, then convert it to Parquet files (each Oracle table has its own set of Parquet files) in order to be ready for Spark queries.
The problem is about consistency. There are two batch, one that dump to CSV (then Parquet) the data tables, and an other that dump to CSV the metadata tables.
So basically, it can happen that at a given time, Spark read the data table with fk_metadata_types_id that doesn't already exists in the corresponding metadata Parquet tables.
How to handle such consistency issues ? The idea here is to have performant queries with Spark, but also guarantee that when the data is queried by Spark, it is always possible (strong consistency) to get the corresponding metadata_types (by a join, like an Oracle join finally).
Thanks