Python Pandas - Parsing XML File



XML stands for "Extensible Markup Language" and it is similar to HTML in structure but serves a different purpose. XML is widely used as a data representation format due to its flexibility.

The Pandas library provides the tools for reading and writing XML data effectively. The read_xml() function of the Pandas library allows you to parse XML strings, files, or URLs directly into a Pandas DataFrame.

In this tutorial, we will learn about how to parse XML data using Pandas, including reading XML data, handling nested structures, extracting attributes, and customizing column names.

The Pandas read_xml() Function

The read_xml() function reads XML data from a string, file, or URL and converts it into a Pandas DataFrame. This function supports various parameters for handling complex XML structures and attributes.

Following is the syntax of the read_xml() function −

 pandas.read_xml(path_or_buffer, *, xpath='./*', namespaces=None, elems_only=False, attrs_only=False, ...) 

Key Parameters,

  • path_or_buffer: Accepts an XML string, file path, URL, or file-like object containing the XML data.

  • xpath: Specifies the XML path to parse specific nodes in the XML. Default is './*'.

  • namespaces: Dictionary of namespaces used in the XML file.

  • elems_only: If True, parses only child elements.

  • attrs_only: If True, parses only attributes.

  • iterparse: For memory-efficient parsing of very large XML files.

  • parser: Specifies the parser (lxml or etree). Default is 'lxml'

You can get more details about this method from the following tutorial pandas.read_xml().

Reading an XML String

The pandas.read_xml() function is used to read XML data directly into a Pandas DataFrame. The function can handle various XML structures.

Example

This example demonstrates how to parse an XML string representing contact information into a Pandas DataFrame. Each <contact> node contains elements such as name, company, and phone. Using pandas.read_xml() function we will extract this data into a DataFrame.

 from io import StringIO import pandas as pd # Create a string representing XML Data xml = """<contact-info> <contact1> <name>Tanmay </name> <company>TutorialsPoint</company> <phone>(011) 123-4567</phone> </contact1> <contact2> <name>Manisha </name> <company>TutorialsPoint</company> <phone>(011) 789-4567</phone> </contact2> </contact-info>""" # Parse the String represented XML data df = pd.read_xml(StringIO(xml)) print(df) 

Following is the output of the above code −

namecompanyphone
0TanmayTutorialsPoint(011) 123-4567
1ManishaTutorialsPoint(011) 789-4567

Parsing Nested XML Structures

For deeply nested or complex XML files, you can use the xpath and namespaces parameters to extract specific nodes.

Example

This example shows how to parse a nested XML structure representing a bookstore. Each <book> node has child elements like title, author, year, and price. By using the xpath parameter we can easily locate and extract these <book> nodes and their contents into a DataFrame.

 import pandas as pd from io import StringIO # Create a String representing XML data xml = """<?xml version="1.0" encoding="UTF-8"?> <bookstore> <book category="cooking"> <title lang="en">Everyday Italian</title> <author>Giada De Laurentiis</author> <year>2005</year> <price>30.00</price> </book> <book category="children"> <title lang="en">Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> <book category="web"> <title lang="en">Learning XML</title> <author>Erik T. Ray</author> <year>2003</year> <price>39.95</price> </book> </bookstore>""" # Parse the XML data into a DataFrame df = pd.read_xml(StringIO(xml), xpath=".//book") # Display the Output DataFrame print('Output DataFrame from XML:') print(df) 

Following is the output of the above code −

 Output DataFrame from XML: 
categorytitleauthoryearprice
0cookingEveryday ItalianGiada De Laurentiis200530.00
1childrenHarry PotterJ K. Rowling200529.99
2webLearning XMLErik T. Ray200339.95

Reading XML Attributes Only

To extract only attributes of the XML data, you can set boolean value true to the attrs_only parameter of the pandas.read_xml() function.

Example

The following example demonstrates parsing only attributes of the XML data into the pandas object using the attrs_only parameter.

 import pandas as pd from io import StringIO # Create a String representing XML data xml = """<?xml version="1.0" encoding="UTF-8"?> <bookstore> <book category="cooking"> <title lang="en">Everyday Italian</title> <author>Giada De Laurentiis</author> <year>2005</year> <price>30.00</price> </book> <book category="children"> <title lang="en">Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> <book category="web"> <title lang="en">Learning XML</title> <author>Erik T. Ray</author> <year>2003</year> <price>39.95</price> </book> </bookstore>""" # Parse the String represented XML data df = pd.read_xml(StringIO(xml), attrs_only=True) print(df) 

Following is the output of the above code −

 category 0 cooking 1 children 2 web 

Customizing Column Names While Reading XML

The names parameter of the Pandas read_xml() function allows you to customize the column names during the parsing process. This is useful for large files with generic or duplicate element names.

Example

The following example demonstrates how to customize column names while parsing XML data using Pandas.

 from io import StringIO import pandas as pd # Create a string representing XML Data xml = """<?xml version="1.0" encoding="UTF-8"?> <bookstore> <book category="cooking"> <title lang="en">Everyday Italian</title> <author>Giada De Laurentiis</author> <year>2005</year> <price>30.00</price> </book> <book category="children"> <title lang="en">Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> <book category="web"> <title lang="en">Learning XML</title> <author>Erik T. Ray</author> <year>2003</year> <price>39.95</price> </book> </bookstore>""" # Parse the String represented XML data df = pd.read_xml(StringIO(xml), names=['Book_Category', 'Book_Name', 'Author', 'Published_year', 'Price']) print(df) 

Following is the output of the above code −

Book_Category
Book_NameAuthorPublished_yearPrice
0cookingEveryday ItalianGiada De Laurentiis200530.00
1childrenHarry PotterJ K. Rowling200529.99
2webLearning XMLErik T. Ray200339.95
Advertisements
close