1
\$\begingroup\$

I am very new to Python, and also this is my first time trying to parse XML.
I am interested in information within str elements. I can identify that information using the str@name attribute value.

def get_cg_resources(pref_label, count=10): r = request_that_has_the_xml ns = {'ns':"http://www.loc.gov/zing/srw/"} tree = ET.ElementTree(ET.fromstring(r.text)) records =[] for elem in tree.iter(tag='{http://www.loc.gov/zing/srw/}record'): record = { 'title':'', 'source': '', 'snippet': '', 'link':'', 'image':'', 'adapter':'CG' } for value in elem.iter(tag='str'): attr = value.attrib['name'] if(attr == 'dc.title'): record['title'] = value.text elif(attr == 'authority_name'): record['source'] = value.text elif(attr == 'dc.description'): record['snippet'] = value.text elif(attr == 'dc.related.link' ): record['link'] = value.text elif(attr == 'cached_thumbnail'): img_part = value.text record['image'] = "http://urlbase%s" % img_part records.append(record) return records 

Is this approach correct/efficient for extracting the information I need? Should I be searching for the str elements differently?

Any suggestions for improvements are welcome.

\$\endgroup\$
2
  • \$\begingroup\$Is request_that_has_the_xml a global variable? Why isn't it a parameter?\$\endgroup\$
    – Attilio
    CommentedMar 31, 2015 at 20:08
  • \$\begingroup\$You can ignore that line, just know that it gives the XML string\$\endgroup\$
    – latusaki
    CommentedApr 1, 2015 at 8:04

1 Answer 1

1
\$\begingroup\$
def get_cg_resources(pref_label, count=10): r = request_that_has_the_xml ns = {'ns':"http://www.loc.gov/zing/srw/"} tree = ET.ElementTree(ET.fromstring(r.text)) 

You dont't need ElementTree to extract data from xml, Element is enough.

 root = ET.fromstring(r.text) 

If 'str' tag is contained only in 'record' tag you don't have to find 'record' tag first. You can simply look for 'str' tag. The iter method recursively iterates over it's children.

There is a dict to represent 'namespace'. So you don't have to explicitly list it's value, 'key:tag', dict is enough.

 for elem in root.iter('ns:str',ns): 

If there are 'str' tags that are contained in other tags that you don't want, then you have to first find 'record' tags.

 records =[] for elem in root.iter('ns:record',ns): record = { 'title':'', 'source': '', 'snippet': '', 'link':'', 'image':'', 'adapter':'CG' } 

record can be initialized as follows,

 record =dict.fromkeys(['title','source','snippet','link','image'],'') record['adapter']='CG' for value in elem.iter('ns:str',ns): attr = value.attrib['name'] if(attr == 'dc.title'): record['title'] = value.text elif(attr == 'authority_name'): record['source'] = value.text elif(attr == 'dc.description'): record['snippet'] = value.text elif(attr == 'dc.related.link' ): record['link'] = value.text elif(attr == 'cached_thumbnail'): img_part = value.text record['image'] = "http://urlbase%s" % img_part records.append(record) return records 

The above code means you want to extract value of the 'name' attribute of 'str' tags which are contained in 'record' tags.

If you want a generator you can simply replace records.append(record) wiht yield record and delete return records and records = [] which will be efficient if the list is huge.

\$\endgroup\$

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.