4
\$\begingroup\$

I have a python script that downloads, unzip and parses an XML file published by a Canadian institution. Only some very specific tags are extracted and then all put into a pandas dataframe for later processing.

Everything works well. I just wonder if there is room for improvement here, specially in the parsing part. I am not sure if the nested for I use are a good idea or there is a better and cleaner way to parse.

import requests import zipfile import os import glob from lxml import etree from io import StringIO, BytesIO import pandas as pd import xml.etree.ElementTree as ET def download_file(url,filename): r = requests.get(url, allow_redirects=True) open(filename, 'wb').write(r.content) def unzip_and_delete(filename): zip_file = zipfile.ZipFile(filename,'r') zip_file.extractall() zip_file.close() os.remove(filename) def parse_xml_fields(file, base_tag, tag_list,final_list): root = etree.parse(file) nodes = root.findall("//{}".format(base_tag)) for node in nodes: item = {} for tag in tag_list: if node.find(".//{}".format(tag)) is not None: item[tag] = node.find(".//{}".format(tag)).text.strip() final_list.append(item) # My variables field_list = ["MsbRegistrationNumber","StatusDescriptionEnglish","Surname","GivenName","MiddleName","Name","StreetAddress"] entities_list = [] download_file('http://www10.fintrac-canafe.gc.ca/msb-esm/public/msb-search/zipdownload-eng/', 'fintrac.zip') unzip_and_delete('fintrac.zip') parse_xml_fields("MsbRegistryPublicDataFile.xml", "MsbInformation", field_list, entities_list) df = pd.DataFrame(entities_list, columns=field_list) df.to_excel("Canada_MSB_List.xlsx") 
\$\endgroup\$

    2 Answers 2

    2
    \$\begingroup\$

    You are traversing the tree searching for the same element twice - once when checking if it exists and then to get the text. You can do it once and find a text with .findtext() directly:

    for node, tag in itertools.product(nodes, tag_list): node_text = node.findtext(".//{}".format(tag)) if node_text is not None: final_list.append({tag: node_text.strip()}) 

    Other notes:

    • remove unused imports and structure them according to PEP8 (reference)
    • use with context managers when opening a file and zip file
    • it would be cleaner if parse_xml_fields() would return a list of results instead of modifying an input list
    \$\endgroup\$
    1
    • \$\begingroup\$Thanks, aside from the main explanation about traversing the tree, I found useful your other notes to get my code cleaner.\$\endgroup\$
      – telex-wap
      CommentedAug 7, 2017 at 15:10
    3
    \$\begingroup\$

    You may import itertools and then replace the part

    for node in nodes: item = {} for tag in tag_list: if node.find(".//{}".format(tag)) is not None: item[tag] = node.find(".//{}".format(tag)).text.strip() final_list.append(item) 

    with

    for node, tag in itertools.product(nodes, tag_list): if node.find(".//{}".format(tag)) is not None: final_list.append({tag: node.find(".//{}".format(tag)).text.strip()}) 

    as itertools.product() creates a cartesian product of your two lists (and so the temporary dictionary item is no longer needed).

    \$\endgroup\$
    2
    • \$\begingroup\$Thanks but I see something strange: this itertools.product approach is creating a new dictionary per every tag match found, when my previous method would create a dictionary containing all tag matches for one node. So for instance, if an entity has 7 tags, rather than one dictionary with 7 keys, I am getting 7 dictionaries with one key.\$\endgroup\$
      – telex-wap
      CommentedAug 7, 2017 at 15:15
    • \$\begingroup\$For such small number of tags and nodes, when moreover the most consuming time are network transfers, it has no importance - the readability has, however (IMHO).\$\endgroup\$
      – MarianD
      CommentedAug 7, 2017 at 16:15

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.