Download, unzip and parse xml into pandas dataframe

Question

I have a python script that downloads, unzip and parses an XML file published by a Canadian institution. Only some very specific tags are extracted and then all put into a pandas dataframe for later processing.

Everything works well. I just wonder if there is room for improvement here, specially in the parsing part. I am not sure if the nested for I use are a good idea or there is a better and cleaner way to parse.

import requests import zipfile import os import glob from lxml import etree from io import StringIO, BytesIO import pandas as pd import xml.etree.ElementTree as ET def download_file(url,filename): r = requests.get(url, allow_redirects=True) open(filename, 'wb').write(r.content) def unzip_and_delete(filename): zip_file = zipfile.ZipFile(filename,'r') zip_file.extractall() zip_file.close() os.remove(filename) def parse_xml_fields(file, base_tag, tag_list,final_list): root = etree.parse(file) nodes = root.findall("//{}".format(base_tag)) for node in nodes: item = {} for tag in tag_list: if node.find(".//{}".format(tag)) is not None: item[tag] = node.find(".//{}".format(tag)).text.strip() final_list.append(item) # My variables field_list = ["MsbRegistrationNumber","StatusDescriptionEnglish","Surname","GivenName","MiddleName","Name","StreetAddress"] entities_list = [] download_file('http://www10.fintrac-canafe.gc.ca/msb-esm/public/msb-search/zipdownload-eng/', 'fintrac.zip') unzip_and_delete('fintrac.zip') parse_xml_fields("MsbRegistryPublicDataFile.xml", "MsbInformation", field_list, entities_list) df = pd.DataFrame(entities_list, columns=field_list) df.to_excel("Canada_MSB_List.xlsx")

alecxe · Accepted Answer · 2017-08-07 13:17:20Z

You are traversing the tree searching for the same element twice - once when checking if it exists and then to get the text. You can do it once and find a text with .findtext() directly:

for node, tag in itertools.product(nodes, tag_list): node_text = node.findtext(".//{}".format(tag)) if node_text is not None: final_list.append({tag: node_text.strip()})

Other notes:

remove unused imports and structure them according to PEP8 (reference)
use with context managers when opening a file and zip file
it would be cleaner if parse_xml_fields() would return a list of results instead of modifying an input list

Thanks, aside from the main explanation about traversing the tree, I found useful your other notes to get my code cleaner. — telex-wap, CommentedAug 7, 2017 at 15:10

MarianD · Accepted Answer · 2017-08-07 13:05:29Z

You may import itertools and then replace the part

for node in nodes: item = {} for tag in tag_list: if node.find(".//{}".format(tag)) is not None: item[tag] = node.find(".//{}".format(tag)).text.strip() final_list.append(item)

with

for node, tag in itertools.product(nodes, tag_list): if node.find(".//{}".format(tag)) is not None: final_list.append({tag: node.find(".//{}".format(tag)).text.strip()})

as itertools.product() creates a cartesian product of your two lists (and so the temporary dictionary item is no longer needed).

Thanks but I see something strange: this itertools.product approach is creating a new dictionary per every tag match found, when my previous method would create a dictionary containing all tag matches for one node. So for instance, if an entity has 7 tags, rather than one dictionary with 7 keys, I am getting 7 dictionaries with one key. — telex-wap, CommentedAug 7, 2017 at 15:15
For such small number of tags and nodes, when moreover the most consuming time are network transfers, it has no importance - the readability has, however (IMHO). — MarianD, CommentedAug 7, 2017 at 16:15

Stack Exchange Network

Download, unzip and parse xml into pandas dataframe

2 Answers 2

Hot Network Questions

Download, unzip and parse xml into pandas dataframe

2 Answers 2

Related

Hot Network Questions