-1

I have a time-series database where I put all my data in it in timely-ordered fashion. Unfortunately, the database doesn't have any realtime listener capability built into it, and I need to make an app that responds to any changes happened to my dataset.

Using that as the context, I came up with idea to regularly query all data within certain time range, and use the query result as entry point for the app that supposed to responds to the data changes.

Do you know any algorithm or technique that can help me doing this efficiently? I tried implement it before, but it seems to be ineffective, since data missing on a query return won't be processed, and I also tried to overlap the time frame on each interval, but that also results in inefficiency, since multiple data get processed multiple time, and the result of the process duplicates.

6
  • 1
    Have you tested the overhead of adding one column to the database to indicate the state of processing with three possible discrete values : (0) new data waiting to be selected (1) selected for processing and not yet completed (2) completed processing ? The benefit of that is to let the database handle the at-most-once aspect of it. (Generally, in order to answer your question, we'll need some performance numbers from various ideas and experiments, but we can't do these without your help.)
    – rwong
    CommentedMar 10, 2024 at 7:29
  • 2
    Performance should not be your first concern. What we need from you is a better description of your system's architecture: first and foremost, will there be only one listener, or will there be multiple ones? Second: what kind of database? A relational one with transactions? Third: Is there only one writer process which guarantees records to insert in a strictly increasing order? If it is the latter, I fail to see why your initial approach misses some records - you should make a scetch in pseudo code to explain the problem in a more detailed fashion.
    – Doc Brown
    CommentedMar 10, 2024 at 7:59
  • Thanks for the replies, @DocBrown yes there will be only one listener, but I'm not sure about the database being a transactional one, it definitely not relational, it's much like Elasticsearch, and I don't really understand why we're using such database. Unfortunately, much like Elasticsearch, there aren't anything that guarantees records to insert in a strictly order, everything is just push and forget. I'm trying to push for idea adding another Database to the system, something that are proper for the requirement, but I got hit back by limited resources and data duplication issue.CommentedMar 10, 2024 at 11:03
  • 1
    @RikiSyahputra: it would definitely help to know precisely about the capabilities or restrictions of your database system in regards to unique ID generation, unique constraints and transactions., and if your question is about how to work around those restrictions. So if my proposed answer does not work for you, please explain why precisely. I would be also interested why you accepted an answer which looks overcomplicated to me.
    – Doc Brown
    CommentedMar 10, 2024 at 11:09
  • 1
    @rwong: the OP wrote "like Elasticsearch" - which does still not tell us clearly tell us what they are using, and which properties the db has Moreover, it seems they are are refusing to add any more details to the question. If I hadn't prematurely retracted my close vote ("needs details or clarity"), I would cast it now.
    – Doc Brown
    CommentedMar 10, 2024 at 20:10

2 Answers 2

1

Poll for data that changed since the latest result of previous query.

Unique monotonous timestamps

Assume that database system you use guarantees that new entries have timestamps later than existing ones. Then the polling just needs to remember what the last processed entry was:

  • each time you receive data, remember the last timestamp processed
  • use that as an exclusive lower bound for the next query

This way, no overlaps or double processing can happen.

Non-unique monotonous timestamps

Time series usually do not guarantee unique timestamps for entries. Assume that the database system guarantees that new entries have timestamps later or equal to existing ones.

This leads to events that happen at the same time as last processed event to be left unprocessed. If lower bound is inclusive, double processing will occur for the last event processed in a previous interval. To avoid this, keep track of events with the latest processed timestamp and filter them out on the next iteration:

  • when data is received, remember all events with the latest timestamp processed
  • use the latest timestamp processed as an inclusive lower bound for the next query
  • filter out remembered events

The key takeaways:

  • do not use wall clock to compute query interval, most DBMS do not provide strong guarantees about it
  • not every DBMS provides monotonous timestamps
  • external event source can't be the source of timestamps, as it does not have DB's monotonous properties
1
  • Thanks, I was convinced to use fixed interval for the app, using the latest timestamp inclusively seems to be the better idea. I might also need to have local cache to filter out processed entries.CommentedMar 10, 2024 at 11:12
1

In case you are free to extend the data schema, forget about the timestamps.

You only want to keep track whether data was processed or not. When the records were created is not important for this use case.

Introduce an additional integer key column with a strictly ordered, unique ID generated by the DB in a transactional manner, and use that as the indicator which records were processed and which not. (Any system which deserves the name "database" offers a mechanic for this.)

Each "listener" (polling process) will need only to keep track of a single ID (the latest processed ID), which is increased after each sucessful processing of a batch of records, so it should be trivial to find out if there are records which were added afterwards (and which ones).

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.