8
\$\begingroup\$

I'm creating an application that reads in data from a CSV file and creates an XML file using LXML. The below code works as expected. However, before I develop it further I would like to refactor it and make it more DRY. I will need to start creating the additional XML snippets for the other code types in the CSV file and I believe, in its current state, there is too much duplication.

I am attempting to learn Python while using a real user case, please can someone provide any guidance on how I could make the code more eloquent and more DRY. Do I need to be using classes and functions or is my approach OK for the type of file (XML) and creating?

import pandas as pd from lxml import etree as et import uuid df = pd.read_csv('assets.csv', sep=',') root = et.Element('SignallingSchemeData', xmlns='boo') timeframe = et.SubElement(root,'TimeFrame') timeframe.attrib["fileUID"] = str(uuid.uuid4()) timeframe.attrib["name"] = str("timeframe 1") for index, row in df.iterrows(): if row['CODE'] == 'S1': equipment = et.SubElement(root, 'Equipment') signalEquipment = et.SubElement(equipment, 'SignalEquipment') signalEquipment.attrib["fileUID"] = str(row["ID"]) signalEquipment.attrib["name"] = str(row["DESC"]) equipment = et.SubElement(root, 'Equipment') equipmentSupportEquipment = et.SubElement(equipment, 'EquipmentSupportEquipment') equipmentSupportEquipment.attrib["fileUID"] = str(uuid.uuid4()) equipmentSupportReference = et.SubElement(signalEquipment, 'EquipmentSupportReference').text = equipmentSupportEquipment.attrib["fileUID"] else: equipment = et.SubElement(root, 'Equipment') source = et.SubElement(root, 'Source') view = et.SubElement(root, 'View') view.attrib["fileUID"] = str(uuid.uuid4()) view.attrib["name"] = str('Coordinates') for index, row in df.iterrows(): viewCoordinatesList = et.SubElement(view, 'ViewCoordinatesList') viewCoordinates = et.SubElement(viewCoordinatesList, 'ViewCoordinates') itemFileUID = et.SubElement(viewCoordinates, 'ItemFileUID') itemFileUID.attrib['fileUID'] = str(row['ID']) viewCoordinatePairLon = et.SubElement(viewCoordinates, 'ViewCoordinatePair', name = 'longittude') viewCoordinatePairLon.attrib['Value'] = str(row['Y']) viewCoordinatePairLat = et.SubElement(viewCoordinates, 'ViewCoordinatePair', name = 'latitude') viewCoordinatePairLat.attrib['Value'] = str(row['X']) viewCoordinatePairH = et.SubElement(viewCoordinates, 'ViewCoordinatePair', name = 'height') viewCoordinatePairH.attrib['Value'] = str(row['Z']) et.ElementTree(root).write('test.xml', pretty_print=True, xml_declaration = True, encoding='UTF-8', standalone = None) 

assets.csv is as follows:

ID,CODE,ELR,TRID,DIR,MILEAGE,X,Y,Z,DESC,IMAGE 30734,S1,LEC1,1100,,008+0249 (9-1497),518169.12,185128.27,37.52,,Asset30734.jpg 31597,S10,LEC1,1100,,008+0286 (9-1460),518151.38,185157.1,36.7,IRJ at 8 miles and 0289 yards,Asset31597.jpg 31598,S10,LEC1,1100,,008+0286 (9-1460),518150.4,185156.11,36.7,IRJ at 8 miles and 0289 yards,Asset31598.jpg 31596,S10,LEC1,1100,,008+0287 (9-1458),518149.76,185157.14,36.7,IRJ at 8 miles and 0289 yards,Asset31596.jpg 
\$\endgroup\$

    1 Answer 1

    5
    \$\begingroup\$

    The real issue with repeating yourself here is that you are iterating over your dataframe twice when you don't have to. Fortunately, the devs for lxml make the elements behave like lists, so you can create everything that you need to in one go, then just do root.append(sub_element) at the end like you would with a list.

    Whenever you find yourself directly iterating over a pandas dataframe, that's usually a sign that it's not the correct data structure for the task. There are certainly exceptions, but I think using the csv module here might be a bit better of a fit. The reason being that you are technically iterating over the whole file with pd.read_csv, only to iterate again. csv.reader will allow you to iterate and process at the same time.

    First, declare all of the trees you might need:

    import csv from lxml import etree as et # configure your elements here. lxml acts as a list, # so make them their own trees until you are ready to merge root = et.Element('SignallingSchemeData', xmlns='boo') source = et.Element('Source') view = et.Element('View') 

    Note that I haven't used the SubElement factory class yet. Since Element can be treated as a list, we can leave the linking of elements to root to the end of the script.

    Moving forward:

    # you want this inserted first, so SubElement is # perfectly fine here timeframe = et.SubElement(root,'TimeFrame') timeframe.attrib["fileUID"] = str(uuid.uuid4()) # you don't need the str function call here, it's already a string timeframe.attrib["name"] = "timeframe 1" # open your file here with open('assets.csv') as fh: # the separator by default is ',' # so no need to specify it reader = csv.DictReader(fh) 

    Now, csv.DictReader allows you to keep the row[key] paradigm because it will create an OrderedDict for each row:

    with open('assets.csv') as fh: reader = csv.DictReader(fh) row = next(reader) row OrderedDict([('ID', '30734'), ('CODE', 'S1'), ('ELR', 'LEC1'), ('TRID', '1100'), ('DIR', ''), ('MILEAGE', '008+0249 (9-1497)'), ('X', '518169.12'), ('Y', '185128.27'), ('Z', '37.52'), ('DESC', ''), ('IMAGE', 'Asset30734.jpg')]) row['ID'] '30734' row['CODE'] 'S1' 

    This requires minimal changes from your original script. You can iterate over the reader just like before except the index doesn't have to be there anymore:

    with open('assets.csv') as fh: reader = csv.DictReader(fh) for row in reader: # rest of loop 

    Refactoring the loop

    Equipment

    First, you can create a function to handle everything that's happening in the if block. This makes the core of the loop more readable, as well as logically separates functions of the code. Overall, the core logic is good. Your variable names are descriptive, if a little lengthy, with the only change really being formatting:

    def add_equipment(root, row): '''Add equipment elements and supporting child elements to root''' equipment = et.SubElement(root, 'Equipment') # I'm grouping blocks of code/variables that group together signal_equipment = et.SubElement(equipment, 'SignalEquipment') signal_equipment.attrib["fileUID"] = row["ID"] signal_equipment.attrib["name"] = row["DESC"] # The names have been changed to snake_case with lowercased # letters, as is the naming conventions of variables and functions # in python (but not classes) equipment = et.SubElement(root, 'Equipment') eq_support = et.SubElement(equipment, 'EquipmentSupportEquipment') eq_support.attrib["fileUID"] = str(uuid.uuid4()) reference = et.SubElement(signal_equipment, 'EquipmentSupportReference').text = eq_support.attrib["fileUID"] 

    I don't need to return anything because root is being modified in-place by these operations.

    Now, the first portion of your loop looks like this:

    with open('assets.csv') as fh: reader = csv.DictReader(fh) for row in reader: if row['CODE'] == 'S1': # now all of this code sits # in a function, and it's much easier to tell what's # going on add_equipment(root, row) else: equipment = et.SubElement(root, 'Equipment') 

    View

    The view can be cleaned up in pretty much the same manner. Again, I don't really see anything egregious other than just some logical grouping and formatting. Your code and logical flow looks good, it's clear and easy to read

    def add_view(view, row): '''Add supporting child elements to view. Pass in row for value extraction''' view_list = et.SubElement(view, 'ViewCoordinatesList') view_coords = et.SubElement(view_list, 'ViewCoordinates') item_file_uid = et.SubElement(view_coords, 'ItemFileUID') item_file_uid.attrib['fileUID'] = row['ID'] longitude = et.SubElement(view_coords, 'ViewCoordinatePair', name='longittude') longitude.attrib['Value'] = row['Y'] latitude = et.SubElement(view_coords, 'ViewCoordinatePair', name='latitude') latitude.attrib['Value'] = row['X'] height = et.SubElement(view_coords, 'ViewCoordinatePair', name='height') height.attrib['Value'] = row['Z'] 

    You don't need the str to wrap row['X'] anymore because csv.DictReader doesn't infer datatypes, everything is a string.

    The loop as a whole now looks like this:

    with open('assets.csv') as fh: reader = csv.DictReader(fh) for row in reader: if row['CODE'] == 'S1': # refactor to a function here that # adds to root and takes row as an argument add_equipment(root, row) else: equipment = et.SubElement(root, 'Equipment') # add the rest of your loop in a function here to # deal with the view tree, again, take row as an argument add_view(view, row) 

    Now, your loop can just be read as a loop, and the logic inside the if blocks is abstracted into functions.

    Constructing your tree

    Now, to get everything in order, it's just a call to root.append:

    root.append(source) root.append(view) 

    And everything is in order with one pass over the csv file. Your last line can remain perfectly intact:

    et.ElementTree(root).write('test.xml', pretty_print=True, xml_declaration=True, encoding='UTF-8', standalone=None) 

    Timing Results

    Original Script

    python -m timeit 'import file' 10 loops, best of 3: 0.118 usec per loop 

    Refactored Script

    python -m timeit 'import file2' 10000000 loops, best of 3: 0.0809 usec per loop 
    \$\endgroup\$
    0

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.