Adding string to a text file without duplicates

Question

I have a function below, store_in_matchid_file, and it adds certain strings to a .txt file as long as that string doesn't already exist in the file. However that file is getting to be millions of lines long and the process of checking is becoming too long.

I was hoping someone would be able to indicate a way I could speed up the process, by changing how I've coded the process.

def store_in_matchid_file(distinct_matchids_func): # Ids stored in the text file with open('MatchIds.txt') as f: current_id_list = f.read().splitlines() # Ids recently collected (Previously a set, converted to list to iterate over) distinct_matchids_list = list(distinct_matchids_func) # Adding Ids that don't exist in the file with open('MatchIds.txt', 'a') as file: for match in distinct_matchids_list: if match not in current_id_list: file.write(match + '\n')

Does it need to be a text file? If so, why? Can you not use a database, at least SQLite? — Reinderien, CommentedJun 10, 2022 at 21:06
Was intending to use the file as a way to collect the ids, to then re-use them later. You're saying that would be better in a database? By feeding them into a different script. — Jack, CommentedJun 10, 2022 at 21:08
A database is designed for this kind of thing - even a simple one. — Reinderien, CommentedJun 10, 2022 at 21:37
Although i strongly agree with the database suggestion, isn't 'code review' a place to get opinions on code optimization and such rather than a complete method flip? Dont take this personal please as its not meant to insult or anything. I just got around theses parts, i should maybe read the sub description of this overflow. — OldFart, CommentedJun 11, 2022 at 4:30

200_success · Accepted Answer · 2022-06-10 21:43:45Z

You could make some improvements:

Open the file just once in read+update mode.
Instead of current_id_list, use a set, which can test whether an item is one of the existing items quickly, regardless of how many existing items there are.
Iterating over the file handle f will give you entries a line at a time; you don't need to read the entire file at once and call .splitlines() on it.
You don't need to explicitly convert distinct_matchids_func into a list; you can just iterate over it.

Some other remarks:

distinct_matchids_func is a weird name — is it a function or an iterable?
We're assuming that the distinct_matchids_func are indeed distinct, and don't contain duplicates.
You should write a docstring for the function rather than a comment.

def store_in_matchid_file(distinct_matchids_func): """ Append items from distinct_matchids_func as lines to MatchIds.txt that are not already present in the file. """ with open('MatchIds.txt', 'r+') as f: current_ids = set(f) for match in distinct_matchids_func: match += '\n' if match not in current_ids: f.write(match)

But as @Reinderien suggests in a comment, storing these entries in a database such as sqlite could be a better option. It's going to perform better (since it uses data structures to facilitate quick searches), and it will be less prone to corruption (for example, if multiple processes try to manipulate the file at the same time).

Stack Exchange Network

Adding string to a text file without duplicates

1 Answer 1

Hot Network Questions

Adding string to a text file without duplicates

1 Answer 1

Related

Hot Network Questions