Matching words from a text with a big list of keywords in Python

Question

I implemented the following code to retrieve medication names from a given text (prescription_text in the code). It works by matching words in the text with a list of existing medications (US_DRUGS in the code).

The code is working but it's very slow since the medication list that I use is pretty big (150.000 items). I need it to be faster.

The key part of this code is obviously this one:

for DRUG in US_DRUGS: for match in regex.finditer(re.escape(DRUG.name), re.escape(normalized_prescription_text), re.IGNORECASE): matched_drugs.append(Drug(DRUG.name, DRUG.atc, match.span()))

Can someone take a look at my code and maybe show me some improvements I could make ?

The code:

import unidecode class Drug: def __init__(self, name, atc, position): # normalize name make it capitalize self.name = name.upper() self.atc = atc self.start = position[0] if position is not None else None self.stop = position[1] if position is not None else None def __eq__(self, other): return self.name == other.name and self.atc == other.atc def __hash__(self): return hash(( 'name', self.name, 'atc', self.atc )) US_DRUGS = [ Drug("MED1", 1, None), Drug("MED2", 2, None), Drug("MED3", 3, None), Drug("MED4", 4, None), Drug("MED5", 5, None), Drug("MED6", 6, None) # imagine this list way bigger (around 150.000 items) ] def _extract_drugs_from_prescription_text(prescription_text): # normalize prescription text (remove accents) normalized_prescription_text = unidecode.unidecode(prescription_text) # remove non word character normalized_prescription_text = re.sub(r'\W+', ' ', normalized_prescription_text) # For every occurrence of a drug's name in the prescription text # it will append a Drug() object with match's details in a list matched_drugs = [] for DRUG in US_DRUGS: for match in regex.finditer(re.escape(DRUG.name), re.escape(normalized_prescription_text), re.IGNORECASE): matched_drugs.append(Drug(DRUG.name, DRUG.atc, match.span())) # Will clean up the matches list from duplicates substring # ex: 'DOLIPRANE' and 'DOLIPRANE CODEINE' # if they start at the same point, first one is removed matched_drugs_without_substring = [] for match in matched_drugs: if [m for m in matched_drugs if m.start <= match.start <= m.stop and len(match.name) < len(m.name)]: pass else: matched_drugs_without_substring.append(match) # remove duplicates return list(set(matched_drugs_without_substring)) if __main__ == "__main__": prescription_text = "- TEST - some example text here with some medication names like MED1, MED2, MED3. End of the test #$%^" _extract_drugs_from_prescription_text(prescription_text)

Please include a textual rendition of the not substring constraint. For greetings and thanks, please see Expected behaviour. — greybeard, CommentedApr 13, 2023 at 16:23
First thought: maybe make a dict with the drugs names as keys, and the drug objects as values and compare the words in the text against the keys? That would be O(1). — Jan van Wijk, CommentedApr 13, 2023 at 17:15
Your code is not working or not complete: undefined name '__main__', undefined name 're', undefined name 'regex' — DeathIncarnate, CommentedApr 13, 2023 at 17:24
@JanvanWijk no way to implement this in O(1) time... I suppose you mean O(1) for each access to the drugs dict? Anyway it won't be that simple if drugs can have names with multiple words. J_H goes into detail about this problem in his answer.... — slepic, CommentedApr 14, 2023 at 4:42

DeathIncarnate · Accepted Answer · 2023-04-13 18:08:15Z

You have some bugs as noted in my comment, though you say the code is working, so I'm going to assume this is a custom snippet you've made for here.

"Normalising" text

You provide few details about the structure of prescription_text other than the possibilities of odd characters. Likewise about drug names, what atc is, etc.

Ask yourself the following questions:

You're removing accents from your text, are your drugs likely to have accents in their names?
If not, do you care that the accents are there as they won't match your drug names anyway?
Do you care if there are non-"word" characters in there?
Are there likely to be non-"word" characters in your drug names (e.g. a hyphen)? In which case we really don't want them stripping.

Searching RE

Instead of iterating through each drug and through all the text multiple times, why not construct a single RE to find (rule) them all? This may be a pain if the list is so big that this takes forever.

# I'm adding the excessive `\b`s here in case you have a drug called # `hand` and the text contains `chandler` for example drug_re = re.compile(r"|".join(rf"\b{drug}\b" for drug in US_DRUGS))

if we're smart about US_DRUGS, we can make it into a dict of drugname: atc, since that's all your care about from the Drug class there (I don't know what atc means, but I assume it's some number)

we can then build this directly into a set with a set comprehension:

matched_drugs = {Drug(match.group(0), US_DRUGS[match.group(0)], match.span()) for match in drug_re.finditer(normalized_prescription_text, re.IGNORECASE)}

or you could use the method proposed here using a dict of key: val where key is the element to elimate duplicates (in your case set({match.start: Drug(...) for ...}.values()), at the same time, the documentation of re.finditer says:

re.finditer(pattern, string, flags=0)
Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.

Note "non-overlapping". So we shouldn't need to do this anyway by using one mega RE. REs tend to try to match the longest possible.

Alternative

Alternatively, we may not need to use REs at all. If we're just searching for if the drug exists in the string:

matched_drugs = {drug for drug in US_DRUGS if drug.name in prescription_text}

Which we then may need to filter to check if overlapping names are in the list, so we might want to capture (drug, prescription_text.index(drug.name)) instead, and find the longest drug name with the same (or overlapping) indices.

Dataclass

Drug is just a store for the relevant data, essentially an fancy tuple. That means we might want to make Drug a dataclass, which makes its intent clear and generates the relevant functions (__init__, __eq__, __hash__) in a standard way.

from dataclasses import dataclass @dataclass class Drug: name: str atc: int

It's as simple as that.

J_H · Accepted Answer · 2023-04-13 19:22:10Z

This code does not appear to be quite ready for review, yet. I will address two topics: algorithm and code style.

Algorithm

We have D drug names; each is a phrase of one or more words, up to a max length of L words.

In a corpus of W words we wish to identify each drug phrase.

There are prefix restrictions on drug names. If "foo" and "bar" are each drugs, then "foo bar" may not be a drug name. If "bar" is not a drug, then we can have both "foo" and the more specific "foo bar". So we prefer longer matches.

We normalize text, both drug names and words in the corpus. It makes sense to do this during initialization, hoisting it out of the matching loop.

1 - current algorithm

In nested loops of roughly D × L × W, we look for exact matches. Let's declare that L is a constant, so it has O(D × W) running time.

2 - memory resident hash

We could run an L window over W words, doing O(1) probes of a hash table of drug names. That is L × W probes, or O(D + W) given that L is a constant.

3 - merge sort

Reverse sort the drug names in O(D log D) time, so a longer "foo bar" appears before "foo". Similarly sort each window of L words in the corpus, with O(W log W) cost. Then merge the two sorted lists in O(D + W) time.

W >> D would give us a cost of O(W log W).

The constant factor here may be more attractive than the hash table approach, as we are doing sequential scans rather than random reads.

This approach is amenable to drug lists and corpora that are larger than RAM, and is well suited to implementing with sqlite or another RDBMS. It also accommodates fuzzy matching, at least when the cost function insists on the first few characters being an exact match. That will be true for fuzzy matches that insist on exact match in the initial word.

Code style

constructor

class Drug: def __init__(self, name, atc, position):

ATC is jargon, and apparently describes an Anatomical Therapeutic Chemical. It merits a # comment.

This class appears to be misnamed. Consider calling it DrugMatch.

Consider accepting a normalized Drug rather than name plus atc.

A """docstring""" should explain the value-add that this class brings to the table. That would help us with design decisions like what args the ctor should accept.

 # normalize name make it capitalize

The "capitalize" remark is merely repeating what the .upper() code already eloquently stated.

We use # comments for the "why", and code for the "how".

 self.start = position[0] if position is not None else None self.stop = position[1] if position is not None else None

Rather than None, model "unknown" with (None, None). Write it as self.start, self.stop = position and be done with it. Better, choose to use a range object rather than a pair of integers. It has the advantage of making it clear we're talking about a half-open interval.

`hash`

 def __hash__(self): return hash(("name", self.name, "atc", self.atc))

I really like that you rely on tuple to offer correct results.

Note that the pair of constant strings isn't helping us out here, and merely slows things down.

BTW, I found it necessary to define __repr__() in order to properly debug this code's behavior.

normalize

def _extract_drugs_from_prescription_text( ... normalized_prescription_text = unidecode.unidecode(prescription_text)

Kudos on the nice naming.

We are normalizing because we want corpus words and drug names to exactly match up. It troubles me that we haven't broken out a common helper which both types of text are sent through.

 # remove non word character normalized_prescription_text = re.sub(r'\W+', ' ', normalized_prescription_text)

Characters like . dot and - dash tend to be important in drug names. Mapping them to SPACE is fine. But did we do that in the Drug class?

longest match

 # Will clean up the matches list from duplicates substring

This is a good comment, thank you, it's helpful.

But the fact that you felt you needed to write it suggests that the code is not adhering to the single responsibility principle. Consider doing Extract Helper.

The third proposed algorithm makes such filtering a bit easier, as we encounter longest match first and can use that to filter shorter matches in immediately following records.

de-dup

 # remove duplicates return list(set(matched_drugs_without_substring))

Instead of list, prefer sorted. What it prints is easier for humans to read, and the deterministic output is much easier for unit tests to evaluate.

This code achieves its design goals, slowly. There are straightforward ways to speed it up.

I would be willing to delegate or accept maintenance tasks on this codebase.

ma1user1ex · Accepted Answer · 2023-04-15 15:42:40Z

The number of the steps that the computer has to do is the length of prescription_text PLUS the length of US_DRUGS PLUS the number of medication found.

This is achieved by Aho-Corasick algorithm. This algorithm formed the basis of the original Unix command fgrep, so you surely can find implementations that are this fast. (GNU grep uses a different algorithm which efficiency for your case is unknown to me, but you may give it a try).

Moreover, you can break the algorithm into two phases: 1. create a state machine (or a tree) from US_DRUGS. 2. apply the machine to prescription_text or to as many files as you wish, (efficiently !). I am optimistic you can do the two phases separately, e.g. in python. See e.g. this (Python with C) and this (Python).

Practically, I recommend trying grep -F -o -f US_DRUGS prescription_text prescription_text2 other_prescription_texts* with several grep (fgrep) implementations to see if you get the speed you need, and as a benchmark that you can compare to your python programs (if you insist on python being the final solution.)

-F means fixed strings (not regular expressions) (equivalent to using fgrep)

-f file means take the patterns from file, one per line.

-o means to print only the matching part, not the whole line.

My apologies for finding barMed if foobarMed is in the prescriptions (substring). May be remove -F and put lines like \<drug\> (or \bdrug\b ??) in the file AND benchmark the speed AGAIN to check it did not reduce the speed. (\<, \> can be implemented efficiently, but may be not.) Or better, check the neighbouring two characters a posteriori by hand.

RootTwo · Accepted Answer · 2023-04-17 00:44:30Z

Flip the problem

The current code loops over 150000+ drug names and checks to see which ones are contained in the text. That's a 150000+ tests.

The main idea is to split the text into words then check which words match a drug name. That's roughly a dozen dict lookups--much faster.

To account for some drugs having multiple words, build a dict keyed by the first word in a drug name. The value is a list of drugs that start with that name.

from collections import defaultdict lookup = defaultdict(list) for drug in US_DRUGS: head, *_ = drug.name.split() lookup[head].append(drug)

Then split the normalized prescription text into words. Check each word to see if it is the first part of a drug name. If it is then check the list of drugs to see which one is actually contained in the text.

matched_drugs = [] for word in normalized_text.split(): drugs = lookup.get(word) if drugs: for drug in drugs: if drug.name in normalized_text: matched_drugs.append(drug)

Obviously, this is just an outline of the code. It needs to be adapted to your actual code. Also, if the text contains a combination drug like "sacubitril and valsartan" the code may add "sacubitril", "valsartan", and "sacubitril and valsartan" to matched_drugs.

Timing

I downloaded a list of the 1100 most common drugs. Then added name mangled variations of each one to get a list of 185_172 drug names. They expanded list should have approximately the same characteristics as the initial list (e.g. same percentage of 2 or 3 word names, same percentage of names with the same 1st word but different second words, etc.)

Using a sample text with 4 random drug names and 7 other words. timeit reports:

original code: about 16 ms revised code: about 3 µs.

An improvement of more than three orders of magnitude.

Stack Exchange Network

Matching words from a text with a big list of keywords in Python

4 Answers 4

"Normalising" text

Searching RE

Alternative

Dataclass

Algorithm

1 - current algorithm

2 - memory resident hash

3 - merge sort

Code style

constructor

`hash`

normalize

longest match

de-dup

Flip the problem

Timing

Hot Network Questions

Matching words from a text with a big list of keywords in Python

4 Answers 4

"Normalising" text

Searching RE

Alternative

Dataclass

Algorithm

1 - current algorithm

2 - memory resident hash

3 - merge sort

Code style

constructor

__hash__

normalize

longest match

de-dup

Flip the problem

Timing

Related

Hot Network Questions

`hash`