This is my first post here and I hope I will get some recommendations to improve my code. I have a parser which processes the file with the following structure:
SE|43171|ti|1|text|Distribution of metastases... SE|43171|ti|1|entity|C0033522 SE|43171|ti|1|relation|C0686619|COEXISTS_WITH|C0279628 SE|43171|ab|2|text|The aim of this study... SE|43171|ab|2|entity|C2744535 SE|43171|ab|2|relation|C0686619|PROCESS_OF|C0030705 SE|43171|ab|3|text|METHODS Between April 2014... SE|43171|ab|3|entity|C1964257 SE|43171|ab|3|entity|C0033522 SE|43171|ab|3|relation|C0085198|INFER|C0279628 SE|43171|ab|3|relation|C0279628|PROCESS_OF|C0030705 SE|43171|ab|4|text|Lymph node stations... SE|43171|ab|4|entity|C1518053 SE|43171|ab|4|entity|C1515946
Records (i.e., blocks) are separated by an empty line. Each line in a block starts with a SE
tag; the text
tag always occurs in the first line of each block (in the 4th field). The program extracts:
- All
relation
tags in a block, and - Corresponding text (i.e., sentence ID (
sent_id
) and sentence text (sent_text
)) from the first line of the block, if relation tag is present in a block. Please note that the relation tag is not necessarily present in each block.
Below is a mapping dictionary between tags and related fields in a file and a main program.
# Specify mappings to parse lines from input file mappings = { "id": 1, "text": { "sent_id": 3, "sent_text": 5 }, "relation": { 'subject': 5, 'predicate': 6, 'object': 7, } }
Finally a code:
def extraction(file_in): """This function extracts lines with 'text' and 'relation' tag in the 4th field.""" extraction = {} file = open(file_in, encoding='utf-8') bla = {'text': []} # Create dict to store sent_id and sent_text for line in file: results = {'relations': []} if line.startswith('SE'): elements = line.strip().split('|') pmid = elements[1] # Extract number from the 2nd field if elements[4] == 'text': tmp = {} for key, idx in mappings['text'].items(): tmp[key] = elements[idx] bla['text'].append(tmp) # Append sent_id and sent_text if elements[4] == 'relation': tmp = {} for key, ind in mappings['relation'].items(): tmp[key] = elements[ind] tmp.update(sent_id = bla['text'][0]['sent_id']) tmp.update(sent_text = bla['text'][0]['sent_text']) results['relations'].append(tmp) extraction[pmid] = extraction.get(pmid, []) + results['relations'] else: bla = {'text': []} # Empty dict to store new sent_id and text_id file.close() return extraction
The output looks like:
import json print(json.dumps(extraction('test.txt'), indent=4)) { "43171": [ { "subject": "C0686619", "predicate": "COEXISTS_WITH", "object": "C0279628", "sent_id": "1", "sent_text": "Distribution of lymph node metastases..." }, { "subject": "C0686619", "predicate": "PROCESS_OF", "object": "C0030705", "sent_id": "2", "sent_text": "The aim of this study..." }, { "subject": "C0085198", "predicate": "INFER", "object": "C0279628", "sent_id": "3", "sent_text": "METHODS Between April 2014..." }, { "subject": "C0279628", "predicate": "PROCESS_OF", "object": "C0030705", "sent_id": "3", "sent_text": "METHODS Between April 2014..." } ] }
Thanks for any recommendation.
extraction()
's doc string is enough to refrain from commenting on not handling "entity
lines". Documenting, or at least commenting on strictness would help the same way.\$\endgroup\$