Python XML - extracting information

Question

I am very new to Python, and also this is my first time trying to parse XML.
I am interested in information within str elements. I can identify that information using the str@name attribute value.

def get_cg_resources(pref_label, count=10): r = request_that_has_the_xml ns = {'ns':"http://www.loc.gov/zing/srw/"} tree = ET.ElementTree(ET.fromstring(r.text)) records =[] for elem in tree.iter(tag='{http://www.loc.gov/zing/srw/}record'): record = { 'title':'', 'source': '', 'snippet': '', 'link':'', 'image':'', 'adapter':'CG' } for value in elem.iter(tag='str'): attr = value.attrib['name'] if(attr == 'dc.title'): record['title'] = value.text elif(attr == 'authority_name'): record['source'] = value.text elif(attr == 'dc.description'): record['snippet'] = value.text elif(attr == 'dc.related.link' ): record['link'] = value.text elif(attr == 'cached_thumbnail'): img_part = value.text record['image'] = "http://urlbase%s" % img_part records.append(record) return records

Is this approach correct/efficient for extracting the information I need? Should I be searching for the str elements differently?

Any suggestions for improvements are welcome.

Is request_that_has_the_xml a global variable? Why isn't it a parameter? — Attilio, CommentedMar 31, 2015 at 20:08
You can ignore that line, just know that it gives the XML string — latusaki, CommentedApr 1, 2015 at 8:04

Nizam Mohamed · Accepted Answer · 2015-04-01 14:41:47Z

def get_cg_resources(pref_label, count=10): r = request_that_has_the_xml ns = {'ns':"http://www.loc.gov/zing/srw/"} tree = ET.ElementTree(ET.fromstring(r.text))

You dont't need ElementTree to extract data from xml, Element is enough.

 root = ET.fromstring(r.text)

If 'str' tag is contained only in 'record' tag you don't have to find 'record' tag first. You can simply look for 'str' tag. The iter method recursively iterates over it's children.

There is a dict to represent 'namespace'. So you don't have to explicitly list it's value, 'key:tag', dict is enough.

 for elem in root.iter('ns:str',ns):

If there are 'str' tags that are contained in other tags that you don't want, then you have to first find 'record' tags.

 records =[] for elem in root.iter('ns:record',ns): record = { 'title':'', 'source': '', 'snippet': '', 'link':'', 'image':'', 'adapter':'CG' }

record can be initialized as follows,

 record =dict.fromkeys(['title','source','snippet','link','image'],'') record['adapter']='CG' for value in elem.iter('ns:str',ns): attr = value.attrib['name'] if(attr == 'dc.title'): record['title'] = value.text elif(attr == 'authority_name'): record['source'] = value.text elif(attr == 'dc.description'): record['snippet'] = value.text elif(attr == 'dc.related.link' ): record['link'] = value.text elif(attr == 'cached_thumbnail'): img_part = value.text record['image'] = "http://urlbase%s" % img_part records.append(record) return records

The above code means you want to extract value of the 'name' attribute of 'str' tags which are contained in 'record' tags.

If you want a generator you can simply replace records.append(record) wiht yield record and delete return records and records = [] which will be efficient if the list is huge.

Stack Exchange Network

Python XML - extracting information

1 Answer 1

Hot Network Questions

Python XML - extracting information

1 Answer 1

Related

Hot Network Questions