1

I'm trying to understand what general mechanisms and/or concepts are available in SQL database last to synchronize local. Here are my inputs and requirements.

Multiple clients have a full copy of the database on their machine(SQLite). Typically at the end of every day, each user would synchronize with the remote database(MySQL). Both local and remote have the same schema. All primary keys are GUIDs.

Synchronizing would involve first pushing up modified/newly created records. Then, downloading any records that the client does not have or that have been modified. (Re-Downloading the whole database it not an option as the whole database is several gigabytes.)

Identifying and Pushing records is straight forward. I'm a little confused on how to correctly pull records modified or new records from the server that the client doesn't have yet.

I am thinking about tracking UTC epoch millisecond time stamps in each record and when syncing, but I envision scenarios where records could be missed. Different clients can have different latencies which may affect time stamps and tracking.

What's are approaches to tracking and pulling down only the changed records?

Also, during push, what type of locking or transactions should be used? I'm inclined to wrap all of the inserts from the whole push into one transaction.

    1 Answer 1

    1

    In terms of something to track changed records, in SQL Server there is the 'rowversion' type that increments a global value for each row that is inserted or updated, so it is very easy to query which rows in a table contain new or updated data, simply by storing the last seen rowversion value and querying any rows of a higher value.

    I understand this functionality isn't present in MySQL. Perhaps the most straightforward way of tracking new changes then, is to apply a trigger to each base table on the server that you are attempting to synchronise, which inserts the primary key of each inserted or updated record in the base table into a secondary audit table, and have an auto-increment primary key on the audit table itself.

    Clients then query the audit table for the rows inserted since the last auto-increment key they saw, and then follow the links in the audit table to retrieve the relevant rows from the base table. And the audit table can be periodically pruned once you are satisfied that all clients are up to date (book-keeping which may entail yet another table on the server, consisting of one row per client, and maintained by each client, to track where they are last up to).

    A similar approach can be used in the opposite direction from client to server. An audit table is prepared locally by each client, and these rows are periodically pushed to the server, then the local audit table cleared.

    There is a caveat with this approach however. Access to the audit table on the server (whether for reading or writing) must be serialised, and the lock on the table must be taken before the stage at which any auto-increment key is reserved. The danger otherwise is of two concurrent transactions completing out-of-order (in terms of the sequence determined by the auto-increment keys which they have reserved) on the audit table, and a synchronisation process occuring between the two completions. This will then retrieve the row with the higher key, store that value as the last seen value, and then the earlier row will become visible but will never be synchronised (because its key falls earlier than the highest last-seen value already stored by the synchronisation which occured prior to the row becoming visible).

    This is probably not a common occurrence in practice, because of the need for close timing, but the risk is present unless the synchronisation process is prevented from occurring whenever a transaction is occuring which has reserved a key value on the audit table but has not yet taken the locks on the table necessary to insert that value. You therefore have to manually take the lock first, before executing any step which will reserve a new auto-increment key.

    The performance implications of this serialisation may be completely acceptable on a server that is only ever modestly loaded (at least in terms of activity on the particular tables being audited for synchronisation), but this sort of locking could cause a concern on a heavily loaded table with a lot of concurrent access.

    Using timestamps to differentiate between new and existing records will pose much the same problems, except that there will cease to be any guarantee that the timestamps are unique, so that in principle (even once the locking concerns are dealt with) you will always have to query any rows that match the same as the last-seen timestamp (and potentially retrieve rows that have already been applied to the client at the time of the last synchronisation), rather than only those that are later.

    And depending on how many rows are in the base table, you may want to index the timestamps for faster identification of the relevant rows, but then suffer a performance penalty on the maintenance of the index. Whereas the separate audit table stores links only to the relevant rows in the base table which are new, and also stores them in the desired ascending order of time.

    What I'd be more concerned about in your scenario is how you resolve the (presumed) possibility of conflicts where multiple clients update a local copy of the same record, then attempt to push these changes to the server. Which then will win? And if a decision is automatically made, will either client be notified that a conflict occurred, so as to verify that the decision was the correct one?

    5
    • Thanks for the detailed response. I've been mulling this over. I'd rather not do housekeeping on another table if I can avoid it. I've thought about tracking changes/record versions local with a GUID, but I think time-stamping records would essentially accomplish this. If I query the server for a timestamp and then insert that into all pushed records that could rule out any slight time differences at the local machines. Also, local machine initiating sync would be in a serialized transaction for the whole sync. If another user would try to sync during transaction, are they refused or waiting?
      – GisMofx
      CommentedFeb 11, 2021 at 23:35
    • @GisMofx, as I say you can use timestamps, but then additional attention needs to be paid to how they are generated and used - in particular, the potential for multiple rows inserted independently of each other, to have the same timestamp. Moreover, without the intervention of a trigger somewhere in the solution, which is where the required locking logic can be placed, you'd have to manually serialise access to the base table, and to the logic which generates the timestamps, upon every mutation of the base table (rather than having it handled automatically by the trigger). (1/2)
      – Steve
      CommentedFeb 12, 2021 at 11:00
    • On serialisation, to be clear, the kind of serialisation required here is not the kind offered by any mode of the transaction control system. For auto-increment values to become visible only in ascending order, is not something guaranteed by default (because it permits far less concurrency - in fact, zero concurrency), which is why the table locks under the terms of your requirements must be seized ahead of time to enforce strictly sequential access to the auto-increment generator (the operation of which is not covered by the transaction control system). (2/2)
      – Steve
      CommentedFeb 12, 2021 at 11:10
    • I'm picking this project back up... Thinking about the audit table. I am trying to wrap my head around the scenario when two users synchronize around the same time. User 1 starts sync/pushing records up. Shortly after, user 2 starts sync/pushing records. For example, if user 2 completes pushing before user 1 and then starts to pull down records while 1 is still pushing, how do we ensure user 2 can get the records from 1. Would a second sync of user 2 pull down user 1's records?
      – GisMofx
      CommentedMay 27, 2024 at 14:37
    • @GisMofx, exactly what your algorithm does and doesn't do, would depend on exactly how you wrote it. It sounds like you're embarking on something that would be very complicated and operates under no guarantees of systemic consistency at any regular interval.
      – Steve
      CommentedMay 27, 2024 at 18:44

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.