Parsing XML in Python

Question

The aim of this code is to return an object which contains all of the movies. Where attributes are not found, they need to return None rather than be undefined and raise an exception.

attrs is a list of all possible properties for that class. Each class loops through and tries to get the attribute. If the attribute is not there, setattr(self,attr,video.attrib.get(attr)) will return None.

The end result is an object which can be referenced like so:

Movielist.movies[0].Video.title would return "30 Minutes Or Less".

The XML structure is as follows:

Each movie is returned as an XML <Video>
Each movie has a <Media> subsection which may contain multiple elements
Within each <Media> subsection are <Part> subsections. There may also be multiple entries here.

I'm sure there must be a better / simpler way of doing this.

A sample of the Python source code follows, along with a sample of XML code returned / scanned by the Movielist class after that:

import httplib import xml.etree.ElementTree as ElementTree def getUrl(server,url): c=httplib.HTTPConnection(server.ip,server.port) c.request("GET",url) r=c.getresponse() return r.read() def getxml(server,serverPath,tag): responseData=getUrl(server,serverPath) e=ElementTree.XML(responseData) if not tag: return e else: return e.findall(tag) class Video(): class Media(): class Part(): def __init__(self,part): attrs=['key', 'duration', 'file', 'size', ] for attr in attrs: setattr(self,attr,part.attrib.get(attr)) def __init__(self,media): attrs=['id', 'duration', 'bitrate', 'width', 'height', 'aspectRatio', 'audioChannels', 'audioCodec', 'videoCodec', 'videoResolution', 'container', 'videoFrameRate', 'optimizedForStreaming', ] for attr in attrs: setattr(self,attr,media.attrib.get(attr)) xmlParts=media.findall("Part") self.parts=[] for part in xmlParts: p=self.Part(part) self.parts.append(p) def __init__(self,server,video): # Get the detail self.key=video.attrib.get('key') x=getxml(server,self.key,"Video") video=x[0] self.video=video attrs=['ratingKey', 'key', 'studio', 'type', 'title', 'contentRating', 'summary', 'rating', 'year', 'tagline', 'thumb', 'art', 'duration', 'originallyAvailableAt', 'addedAt', 'updatedAt', 'titleSort', 'viewCount', 'viewOffset', 'guid', 'lastViewedAt', 'index', ] for attr in attrs: setattr(self,attr,video.attrib.get(attr)) self.media=None for subitem in self.video: if subitem.tag=="Media": self.media=self.Media(subitem) class Server: def __init__(self,ip,port): self.ip=ip self.port=port def movielisturl(self,key): return "/library/sections/%s/all" % key class Movielist: def __init__(self,server,key): x=getxml(server,server.movielisturl(key),"Video") print x self.movies=[] for item in x: m=Video(server,item) self.movies.append(m) def main(): s=Server("localhost",32400) l=Movielist(s,4) for video in l.movies: print video.title if __name__ == "__main__": main()

XML Sample

<?xml version="1.0" ?> <MediaContainer art="/:/resources/movie-fanart.jpg" identifier="com.plexapp.plugins.library" mediaTagPrefix="/system/bundle/media/flags/" mediaTagVersion="1323514612" size="260" title1="Movies" viewGroup="movie" viewMode="65592"> <Video addedAt="1323265351" art="/library/metadata/242/art?t=1324432778" contentRating="R" duration="4980000" key="/library/metadata/242" originallyAvailableAt="2011-08-12" ratingKey="242" studio="Media Rights Capital" summary="30 Minutes or Less is a 2011 American action comedy crime film directed by Ruben Fleischer starring Jesse Eisenberg, Aziz Ansari, Danny McBride and Nick Swardson. It is produced by Columbia Pictures and funded by Media Rights Capital." thumb="/library/metadata/242/thumb?t=1324432778" title="30 Minutes or Less" type="movie" updatedAt="1324432778" year="2011"> <Media aspectRatio="2.35" audioChannels="6" audioCodec="dca" bitrate="1500" container="mkv" duration="4983000" height="534" id="242" optimizedForStreaming="0" videoCodec="h264" videoFrameRate="24p" videoResolution="720" width="1280"> <Part duration="4983000" file="/mnt/raid/Entertainment/Movies/30 Minutes Or Less (2011).mkv" key="/library/parts/249/file.mkv" size="3521040154"/> </Media> <Genre tag="Comedy"/> <Genre tag="Action"/> <Writer tag="Michael Diliberti"/> <Director tag="Ruben Fleischer"/> <Country tag="USA"/> <Role tag="Jesse Eisenberg"/> <Role tag="Danny McBride"/> <Role tag="Nick Swardson"/> </Video> <Video addedAt="1323265356" art="/library/metadata/375/art?t=1324432778" contentRating="PG-13" duration="7140000" key="/library/metadata/375" originallyAvailableAt="2010-06-11" rating="7.8" ratingKey="375" studio="Dune Entertainment" summary="The A-Team is an American action film based on the television series of the same name. It was released in cinemas in the United States on June 11, 2010, by 20th Century Fox. The film was directed by Joe Carnahan and produced by Stephen J. Cannell and the Scott brothers Ridley and Tony. The film has been in development since the mid 1990s, having gone through a number of writers and story ideas, and being put on hold a number of times. Producer Stephen J. Cannell wished to update the setting, perhaps using the first Gulf War as part of the backstory. The film stars Liam Neeson, Bradley Cooper, Quinton Jackson, and Sharlto Copley as The A-Team, former U.S. Army Rangers, imprisoned for a crime they did not commit. They escape and set out to clear their names. Jessica Biel, Patrick Wilson, and Brian Bloom fill supporting roles. The film received mixed reviews from critics and performed slightly below expectations at the box office. The reception from the cast of the original television series was also mixed." tagline="There Is No Plan B" thumb="/library/metadata/375/thumb?t=1324432778" title="The A-Team" titleSort="A-Team" type="movie" updatedAt="1324432778" year="2010"> <Media aspectRatio="2.35" audioChannels="6" audioCodec="dca" bitrate="1500" container="mkv" duration="8013000" height="544" id="375" optimizedForStreaming="0" videoCodec="h264" videoFrameRate="24p" videoResolution="720" width="1280"> <Part duration="8013000" file="/mnt/raid/Entertainment/Movies/The A-Team Extended 720P Bluray X264-Zmg.mkv" key="/library/parts/382/file.mkv" size="8531243656"/> </Media> <Genre tag="Thriller"/> <Genre tag="Action"/> <Writer tag="Skip Woods"/> <Writer tag="Joe Carnahan"/> <Director tag="Joe Carnahan"/> <Country tag="USA"/> <Role tag="Patrick Wilson"/> <Role tag="Liam Neeson"/> <Role tag="Sharlto Copley"/> </Video> </MediaContainer>

Winston Ewert · Accepted Answer · 2011-12-21 20:56:11Z

import httplib import xml.etree.ElementTree as ElementTree def getUrl(server,url): c=httplib.HTTPConnection(server.ip,server.port) c.request("GET",url) r=c.getresponse() return r.read()

Firstly, the function doesn't really match its name. getUrl makes me think it should return a URL, not take my url and download it. Secondly, there is function urllib.urlopen which does everything this function does minus the last line. And it takes a url as an argument, so you can combine the two parameters here.

def getxml(server,serverPath,tag): responseData=getUrl(server,serverPath) e=ElementTree.XML(responseData)

But some spaces around those =. I also think you shouldn't use single letter variables name (usually).

 if not tag:

When checking for None I recommend tag is not None just to be explicit.

 return e else: return e.findall(tag) class Video():

Either drop the parens here, or put object in them.

 class Media(): class Part():

I'm not a fan of nesting classes (usually). I'd separate them out.

 def __init__(self,part): attrs=['key', 'duration', 'file', 'size', ] for attr in attrs: setattr(self,attr,part.attrib.get(attr))

I would recommend against loading from XML in your classes' constructor. It ties your internal objects too much into the structure of the xml file. I'd suggest having external function which extract the data from the XML and then pass it into the constructor.

 def __init__(self,media):

Here's the reason why nesting classes is problematic. Its hard to tell what class I'm in now.

 attrs=['id', 'duration', 'bitrate', 'width', 'height', 'aspectRatio', 'audioChannels', 'audioCodec', 'videoCodec', 'videoResolution', 'container', 'videoFrameRate', 'optimizedForStreaming', ] for attr in attrs: setattr(self,attr,media.attrib.get(attr))

We see this piece of code exactly in the previous constructor. You should move it into a function.

 xmlParts=media.findall("Part") self.parts=[] for part in xmlParts: p=self.Part(part) self.parts.append(p)

Instead you can do self.parts = map(self.Part, media.findall("Part")). That will make this less complicated here.

 def __init__(self,server,video): # Get the detail self.key=video.attrib.get('key')

Do you really want to get None here if the key is missing?

 x=getxml(server,self.key,"Video") video=x[0]

Any reason not to combine those two lines? Your constructor fetches the data from the server. As with the xml parsing I think this is best done outside of the constructor.

 self.video=video

Why do you want to keep a reference the xml node? Shouldn't you just toss it once you've finished reading from it?

 attrs=['ratingKey', 'key', 'studio', 'type', 'title', 'contentRating', 'summary', 'rating', 'year', 'tagline', 'thumb', 'art', 'duration', 'originallyAvailableAt', 'addedAt', 'updatedAt', 'titleSort', 'viewCount', 'viewOffset', 'guid', 'lastViewedAt', 'index', ] for attr in attrs: setattr(self,attr,video.attrib.get(attr)) self.media=None for subitem in self.video: if subitem.tag=="Media": self.media=self.Media(subitem)

Is there a find function or something to make this simpler?

class Server:

The class name should probably be more specific then just Server

 def __init__(self,ip,port): self.ip=ip self.port=port def movielisturl(self,key): return "/library/sections/%s/all" % key

I'd probably have the server class define a method that returns the XML rather then just providing the URL. That the way the server can actually fetch the XML anyway it likes not just a http connection.

class Movielist: def __init__(self,server,key): x=getxml(server,server.movielisturl(key),"Video") print x self.movies=[] for item in x: m=Video(server,item) self.movies.append(m)

I wouldn't have a MovieList class. I'd just have a list of movies. I'd have a function that returns a list of movies from a particular server.

def main(): s=Server("localhost",32400) l=Movielist(s,4) for video in l.movies:

You are mixing video and movie, I recommend picking a terminology and sticking with it.

 print video.title if __name__ == "__main__": main()

San4ez · Accepted Answer · 2011-12-22 06:08:49Z

You may use lxml lib

from cStringIO import StringIO from lxml import etree xml = StringIO(''' YOUR XML ''') def main(): tree = etree.parse(xml) print tree.xpath("//Video/@title")

Outputs:

['30 Minutes or Less', 'The A-Team']

Stack Exchange Network

Parsing XML in Python

2 Answers 2

Hot Network Questions

Parsing XML in Python

2 Answers 2

Related

Hot Network Questions