I am developing an application in Java to parse and upload records from a CSV to an online database, via a REST API.
While I know for sure that there are no duplicate records in each CSV file, I cannot be sure that each CSV file has only been processed once (*see clarification below), so I need to check for duplicates before inserting.
[CLARIFICATION] I cannot implement a solution by checking that each CSV file has only been processed once. The CSV files contain bank transaction records downloaded from a bank. Therefore I know that each individual CSV file does not contain duplicates. However, multiple CSV files could be downloaded for the same date range, or for overlapping date ranges, etc. - so I need to check for duplicates at the transaction level rather than the file level.
Unfortunately I have no control over the back-end database, and I can only use the methods available via the API. This means the usual solutions using SQL (e.g. this question) are not suitable.
Methods available from API (http://help.moneytrackin.com/index.php/REST_API):
listTransactions
editTransaction
insertTransaction
Methods available but probably not relevant:
listProjects
listWriteProjects
getBalance
getTags
newProject
deleteProject
listTagTransactions
deleteTransaction
listCurrencies
userData
It is not a huge database: only four columns, and a few thousand records.
It seems like my only option is to iterate over each record to be inserted and compare it to each record from the database:
get ListOfRecordsInDb from database using listRecords(). Store in HashMap, local database or similar data structure?? for each record to be inserted, iterate over ListOfRecordsInDb, checking none of them match the record to be inserted if no match found, insert record
This seems very inefficient. Are there any other options? If not, what is the most efficient way to compare thousands of records, using Java?
Answers to comments/questions:
What happens if you call insertTransaction with a transaction that already exists? Does it duplicate it or does it fail?
The transaction is successfully inserted as a duplicate
Does the CSV file have an "id" column?
No. The available columns are Date, Description, Amount and Balance. The combination of these makes each record unique, so I could potentially create an ID based on these.
Does listRecords() allow pagination, or can it only return all of the records?
It only returns the records, in XML format.
insertTransaction
with a transaction that already exists? Does it duplicate it or does is fail?listRecords()
allow pagination, or can it only return all of the records?