2

I have an existing production Oracle Database. However, there are performance issues for certain kind of operations, because of the volume of the data, or the complexity of queries.

That's why I regularly export/dump each Oracle table to a CSV. Then such CSVs are converted to Parquet files in order to allow very high performance queries with Spark. However, my concern is about loosing strong consistency benefits.

Suppose two tables in Oracle like :

data (id, value, fk_metadata_types_id) metadata_types (id, label) 

As of now, I export regularly such two tables, then convert it to Parquet files (each Oracle table has its own set of Parquet files) in order to be ready for Spark queries.

The problem is about consistency. There are two batch, one that dump to CSV (then Parquet) the data tables, and an other that dump to CSV the metadata tables.

So basically, it can happen that at a given time, Spark read the data table with fk_metadata_types_id that doesn't already exists in the corresponding metadata Parquet tables.

How to handle such consistency issues ? The idea here is to have performant queries with Spark, but also guarantee that when the data is queried by Spark, it is always possible (strong consistency) to get the corresponding metadata_types (by a join, like an Oracle join finally).

Thanks

6
  • 1. Have you considered materialized views?CommentedFeb 2, 2020 at 18:17
  • I am also a little confused. If data has a foreign key, wouldn't the id always need to be in the metadata_types table? Not sure why there is a consistency problem?CommentedFeb 2, 2020 at 18:19
  • Thanks. If I see Oracle, there is always a fk_metadata_types_id defined in the data table. The problem is that the regularly dumped CSVs files corresponding to each table, are done asynchonously. At 10 oclock, the data table is being dumped to CSV, and at 12 oclock, the metadata table is dumped to CSV. Even if both data are dumped at the same time, it can happen that in the data CSV there are metadata_types IDs that doesn't exists yet in the metadata CSV. So when I convert the CSVs to Parquet, there are the same eventual consistency problem. So my question is how to get strong consistency ?
    – Klun
    CommentedFeb 2, 2020 at 20:21
  • By directly exporting the join between the data table and types metadata table in the CSV each time, instead of two different batchs (one for data, and an other for metadata) ? (But If I do that, the export is likely to take longer and more ressource, as the metadata labels and properties will be repeated each time ?). It is more clear for you ?
    – Klun
    CommentedFeb 2, 2020 at 20:22
  • No that (missing data in the foreign key table) is not possible unless you are deleting stuff from the foreign key table after 10 oclock. I think you might be missing something.CommentedFeb 2, 2020 at 22:12

2 Answers 2

1

On the front end, most database have a snapshot isolation level where you can run multiple commands against the same database state. What this means is all transactions which completed before yours remain available and all transactions begun after yours remain unavailable. When running multiple exports under such a transaction, referential integrity should be preserved.

On the backend, in ETL speak, this problem is known as a late arriving dimension. There are multiple strategies, like holding back the incomplete records or using temporary values. For the latter, the label then would for example read future_label_labelid and would be updated in the next run.

    0

    This is easily solved by giving 'as of scn' syntax for each select:

    SELECT current_scn FROM v$database; 12345 SELECT * FROM data AS OF SCN 12345; SELECT * FROM metadata_types AS OF SCN 12345; 

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.