Parse selected records from empty-line separated file

Question

This is my first post here and I hope I will get some recommendations to improve my code. I have a parser which processes the file with the following structure:

SE|43171|ti|1|text|Distribution of metastases... SE|43171|ti|1|entity|C0033522 SE|43171|ti|1|relation|C0686619|COEXISTS_WITH|C0279628 SE|43171|ab|2|text|The aim of this study... SE|43171|ab|2|entity|C2744535 SE|43171|ab|2|relation|C0686619|PROCESS_OF|C0030705 SE|43171|ab|3|text|METHODS Between April 2014... SE|43171|ab|3|entity|C1964257 SE|43171|ab|3|entity|C0033522 SE|43171|ab|3|relation|C0085198|INFER|C0279628 SE|43171|ab|3|relation|C0279628|PROCESS_OF|C0030705 SE|43171|ab|4|text|Lymph node stations... SE|43171|ab|4|entity|C1518053 SE|43171|ab|4|entity|C1515946

Records (i.e., blocks) are separated by an empty line. Each line in a block starts with a SE tag; the text tag always occurs in the first line of each block (in the 4th field). The program extracts:

All relation tags in a block, and
Corresponding text (i.e., sentence ID (sent_id) and sentence text (sent_text)) from the first line of the block, if relation tag is present in a block. Please note that the relation tag is not necessarily present in each block.

Below is a mapping dictionary between tags and related fields in a file and a main program.

# Specify mappings to parse lines from input file mappings = { "id": 1, "text": { "sent_id": 3, "sent_text": 5 }, "relation": { 'subject': 5, 'predicate': 6, 'object': 7, } }

Finally a code:

def extraction(file_in): """This function extracts lines with 'text' and 'relation' tag in the 4th field.""" extraction = {} file = open(file_in, encoding='utf-8') bla = {'text': []} # Create dict to store sent_id and sent_text for line in file: results = {'relations': []} if line.startswith('SE'): elements = line.strip().split('|') pmid = elements[1] # Extract number from the 2nd field if elements[4] == 'text': tmp = {} for key, idx in mappings['text'].items(): tmp[key] = elements[idx] bla['text'].append(tmp) # Append sent_id and sent_text if elements[4] == 'relation': tmp = {} for key, ind in mappings['relation'].items(): tmp[key] = elements[ind] tmp.update(sent_id = bla['text'][0]['sent_id']) tmp.update(sent_text = bla['text'][0]['sent_text']) results['relations'].append(tmp) extraction[pmid] = extraction.get(pmid, []) + results['relations'] else: bla = {'text': []} # Empty dict to store new sent_id and text_id file.close() return extraction

The output looks like:

import json print(json.dumps(extraction('test.txt'), indent=4)) { "43171": [ { "subject": "C0686619", "predicate": "COEXISTS_WITH", "object": "C0279628", "sent_id": "1", "sent_text": "Distribution of lymph node metastases..." }, { "subject": "C0686619", "predicate": "PROCESS_OF", "object": "C0030705", "sent_id": "2", "sent_text": "The aim of this study..." }, { "subject": "C0085198", "predicate": "INFER", "object": "C0279628", "sent_id": "3", "sent_text": "METHODS Between April 2014..." }, { "subject": "C0279628", "predicate": "PROCESS_OF", "object": "C0030705", "sent_id": "3", "sent_text": "METHODS Between April 2014..." } ] }

Thanks for any recommendation.

Welcome to CodeReview@SE. Shall parsing be lenient (allow variations as long as they are recognisable) or strict (e.g., insist on the 3rd part to abbreviate title or abstract, resp.)? — greybeard, CommentedSep 14, 2020 at 7:25
I'm afraid that I don't understand your question completely. — Andrej, CommentedSep 14, 2020 at 7:29
extraction()'s doc string is enough to refrain from commenting on not handling "entity lines". Documenting, or at least commenting on strictness would help the same way. — greybeard, CommentedSep 14, 2020 at 7:33

FMc · Accepted Answer · 2020-09-14 18:18:12Z

Nothing in your code is unreasonable and you've done a careful job. Nonetheless, I would encourage you to decompose the parsing into smaller, simpler steps.

As a first step in that direction, a parsing function should take some text (either a blob or an iterable of lines) and return or yield meaningful data. It should not be concerned with the source of that text or the details of how to obtain it (eg reading a file). So the top-level extraction() function should just open the file and give the file handle to a separate parsing function. For example:

def extraction(file_in): with open(file_in, encoding='utf-8') as fh: return parse(fh)

And the parse() function should also focus more tightly on orchestrating the parsing steps and assembling the final data structure. In the illustration below, notice that this design simplifies what one has to do to test the parsing logic. Just feed it a list of lines -- no need to create a file for the text. That's better for automated testing, better for quick experimentation, better for sharing code with people on this website.

def parse(lines): extr = {} for rec in parse_records(lines): pmid, relations = parse_record(rec) extr.setdefault(pmid, []).extend(relations) return extr

Your data format implies two parsing phases. The first is to break the full text into records (the blocks or paragraphs), as shown below. Notice how easy this code is to understand: it's just breaking text apart into blank-line delimited sections. Also notice that this function is general-purpose and can be re-used in other parsing situations.

def parse_records(lines): rec = [] for line in lines: if line.strip(): rec.append(line) elif rec: yield rec rec = [] if rec: yield rec

The second phase is to parse a single record. Just as we relieved the parsing function of the burden of knowing about file opening and reading, the record parser should be relieved of the burden of knowing about the larger file structure. The function below illustrates one way to parse the record. A few things worth noting. First, I don't think your mapping dict is really helping you. Even though the impulse to create it was understandable, it didn't truly allow you to generalize/parameterize the parsing function (it still contained hard-coded column locations, for example, and also had to know about the main sign posts in the input text, specifically the tokens SE, |, text, and relations and how to interpret them for parsing purposes). Since it didn't get you fully to re-usable code, I would just embrace a little hard-coding. Second, notice how this code is also easy to scan, read, and understand -- because it's doing so much less. Finally, because of that tighter focus, the function is fairly easy to test now. And that's important for this function because it's the one with the most algorithmic complexity and potential for problems due to irregular data.

def parse_record(rec): pmid = None sent_id = None sent_text = None relations = [] for line in rec: if not line.startswith('SE'): continue xs = line.strip().split('|') row_type = xs[4] if row_type == 'text': pmid = xs[1] sent_id = xs[3] sent_text = xs[5] elif row_type == 'relation': relations.append({ 'subject': xs[5], 'predicate': xs[6], 'object': xs[7], 'sent_id': sent_id, 'sent_text': sent_text, }) return (pmid, relations)

Great review, one suggestion. Given the requirements, in parse_record I would throw an exception when a line doesn't start with "SE" or when "text" is not in the first line. — Marc, CommentedSep 14, 2020 at 9:07
@Marc Thanks for reminding me (at one point I had the same thought, but was too lazy or forgetful). Fixed. Other errors can occur in this function if the data is not fully regular, but the OP can address those specifics. — FMc, CommentedSep 14, 2020 at 9:55

Stack Exchange Network

Parse selected records from empty-line separated file

1 Answer 1

Hot Network Questions

Parse selected records from empty-line separated file

1 Answer 1

Related

Hot Network Questions