4

I am developing an application in Java to parse and upload records from a CSV to an online database, via a REST API.

While I know for sure that there are no duplicate records in each CSV file, I cannot be sure that each CSV file has only been processed once (*see clarification below), so I need to check for duplicates before inserting.

[CLARIFICATION] I cannot implement a solution by checking that each CSV file has only been processed once. The CSV files contain bank transaction records downloaded from a bank. Therefore I know that each individual CSV file does not contain duplicates. However, multiple CSV files could be downloaded for the same date range, or for overlapping date ranges, etc. - so I need to check for duplicates at the transaction level rather than the file level.

Unfortunately I have no control over the back-end database, and I can only use the methods available via the API. This means the usual solutions using SQL (e.g. this question) are not suitable.

Methods available from API (http://help.moneytrackin.com/index.php/REST_API):

  • listTransactions

  • editTransaction

  • insertTransaction

Methods available but probably not relevant:

  • listProjects

  • listWriteProjects

  • getBalance

  • getTags

  • newProject

  • deleteProject

  • listTagTransactions

  • deleteTransaction

  • listCurrencies

  • userData

It is not a huge database: only four columns, and a few thousand records.

It seems like my only option is to iterate over each record to be inserted and compare it to each record from the database:

get ListOfRecordsInDb from database using listRecords(). Store in HashMap, local database or similar data structure?? for each record to be inserted, iterate over ListOfRecordsInDb, checking none of them match the record to be inserted if no match found, insert record 

This seems very inefficient. Are there any other options? If not, what is the most efficient way to compare thousands of records, using Java?


Answers to comments/questions:

What happens if you call insertTransaction with a transaction that already exists? Does it duplicate it or does it fail?

The transaction is successfully inserted as a duplicate

Does the CSV file have an "id" column?

No. The available columns are Date, Description, Amount and Balance. The combination of these makes each record unique, so I could potentially create an ID based on these.

Does listRecords() allow pagination, or can it only return all of the records?

It only returns the records, in XML format.

5
  • What are the the rest of the methods available ("and few others")?CommentedJun 17, 2015 at 12:38
  • What happens if you call insertTransaction with a transaction that already exists? Does it duplicate it or does is fail?CommentedJun 17, 2015 at 13:05
  • Does the CSV file have an "id" column?
    – paj28
    CommentedJun 17, 2015 at 15:46
  • Does listRecords() allow pagination, or can it only return all of the records?CommentedJun 17, 2015 at 19:18
  • See updated questionCommentedJun 19, 2015 at 10:04

3 Answers 3

4

I cannot be sure that each CSV file has only been processed once...

You might want to attempt to solve your question by handling this first. If I am getting this right, the crux of your problem doesn't appear to be individual duplicate transactions (since you mentioned "I know for sure that there are no duplicate records in each CSV file"), but to prevent duplicate processing per file.

Hence, you can consider adding some kind of state logic in your Java application that knows whether a file has been processed by computing and storing its checksum, e.g. its MD5 hash. Once you have a matching checksum, you know that there's a good chance the file has been processed before. You can perform further verification such as by inspecting the number of lines, or other certain unique identifiers of each file.

Further extending this idea, if there are possibilities of the same transaction appear across different CSV files, then your only other option, besides updating the database schema to handle duplicate records properly, is to store all the processed transactions locally within your Java application. If there can be multiple instances of your application (either on the same computer, or across a network), then you'll either need yet another centralized database to handle this, or some distributed data grid... by then, the better option is still back to the drawing board to improve on your existing database schema.

edit

To flip things around, the other considerations to look into , if changing the database schema to handle duplicates nicely is entirely don't-even-think-about-it nigh impossible, is to evaluate how much data your Java application will need to process at any given time, and how fast the connection is between the database and your application.

On the lower end, say your application is processing only 10 records per file, averaging one file an hour. The network connection is very good, say almost as good as accessing a locally-hosted database. In this case, I don't think there's much of a performance impact from having to query all the records.

On the extreme end, your application is expected to read thousand-line-long transaction files every 10 seconds, and the network connection is extremely bad, say taking a minute to query all the records. In this case, you have more concern about processing the files in a speedy manner, and this is how you can probably suggest modifying the database schema. :)

So, assuming all is fine in the lower-end case, what would be an efficient way of comparing a relatively large data set with a smaller input set for duplicates? I'll suggest marshalling the XML payload you get into a HashSet. Also, I hope you have a Transaction domain class that has properly-implemented hashCode() and equals() methods. A Java 8 potential solution would then be:

// assuming your database records are marshalled into currentSet inputSet.stream().filter(v -> !currentSet.contains(v)) .forEach( /* these are the new records to send to the database */); 

Also, the elephant in the room: concurrent insertions. Will there be any? If so, how do you intend to handle it then?

2
  • 1
    thanks for taking the time to respond. Unfortunately this solution would not work for me - see clarification above. Apologies that my question was not clearer. +1 for good solution which might work for someone else.CommentedJun 19, 2015 at 9:55
  • @LydiaRalph see my edited answer too.
    – h.j.k.
    CommentedJun 19, 2015 at 10:40
2

Considering you stated that it is about REST API, think that each call to the API will result in some network traffic, where latency and transfer time apply. So I would say the less calls to the API you make (store results in memory to search through), the better.

However, when you work with the API, you don't always have a choice: if the API itself is not well engineered, or not engineered for purposes like your in mind, you might have no choice but do as you stated.

Also, consider that a remote service (REST API) can be used concurrently by many users, which means that loading all the results in memory or local DB, to process them later, may cause issues because the remote data has changed in the meantime.

I'm afraid mine is not a proper answer to your question, as I believe that a proper answer can be only given providing that you first document well what are the API calls available, their inputs and outputs, but you can benefit from these few tips, I hope.

    2

    You are being squeezed in both directions, with tricky input and a tricky back-end. In an ideal world, you'd challenge some of these constraints, but I get the impression you won't get anywhere.

    The algorithm you describe sounds like your only option, although I think it can be improved on a little.

    1. Call listTransactions and store all transactions in a HashSet
    2. Iterate through the CSV file:
      1. If the line is in the HashSet, ignore
      2. Otherwise, insertTransaction

    With a few thousand rows only, I think it's acceptable to call listTransactions every time someone uploads a CSV. If the database was larger you would have to consider creating your own database, that mirrored the REST back-end, solely for the purpose of avoiding duplicates.

    One thing I don't understand, although it's not that relevant, is how the REST API can have an editTransaction when the transactions don't have an ID.

    Edit - ok, although the CSV file has no ID, the back-end REST service does have an ID.

    4
    • Strangely enough, the editTransaction does have an ID, so the db/API must have a back-end internal ID for each transaction, which is not exposed via the listTransactions method. See help.moneytrackin.com/index.php/REST_API for complete list.CommentedJun 19, 2015 at 10:43
    • @LydiaRalph - ok, I can see then that listTransactions returns the id as an attribute on the <transaction> tag
      – paj28
      CommentedJun 19, 2015 at 10:46
    • you've got sharper eyes than me. Are you suggesting I could use the ID? I suspect this is generated every time a record is inserted so would not prevent duplicates...?CommentedJun 19, 2015 at 10:49
    • @LydiaRalph - as your incoming CSV does not have an ID, I doubt it will help you.
      – paj28
      CommentedJun 19, 2015 at 11:03

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.