How to extract data between two different xml tags

Question

I have looked but haven't been able to find anyone else with the same sort of problem I have.

I have an xml file like this:

<ID>1</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>3</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>4</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>

Basically a whole bunch of data all on one line, no line breaks. I need to extract the info (preferably just as-is with tags intact) between a specific < ID> tag (eg < ID>2 )and the very next < /dateAccessed> tag. I have about 50 files to check for a particular ID and the following related data. I get that this is not standard, there is no nesting.

I originally tried to do this using grep and sed, but I just get the whole file returned, which seems odd to me. Can't I just treat this like a text file?

EDIT:

I didn't realise the formatter removed text that was in enclosing < and > , so after re-reading my question this morning, I realised it's asking something completely different. TL;DR I need what is between a specific value between ID tags and the next closing DateAccessed tag. Not between the same opening and closing tags, ie between ID and /ID

So I can get something like this result:

<ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>

I can't help but feel this is the "wrong question". If you're working with XML files you should really be using an XML parser (such as xmlstarlet). I appreciate this won't give you an unbalanced segment, and so is not a suitable answer to your question as asked. But trying to treat XML as text will almost certainly lead to unintended consequences down the road. It's not a good place to be. Really. — Chris Davies, CommentedOct 18, 2017 at 7:14

igal · Accepted Answer · 2018-07-11 22:23:26Z

As noted in the comments, your data isn't well-formed XML and it isn't completely clear what the structure of your document is, e.g. judging by your example data, it looks like you have no nested elements - is that really the case?

With that caveat in mind, here's a Python script that uses the BeautifulSoup4 parsing library to do what you want (i.e. it produces the desired output data for the given example input data):

#!/usr/bin/env python # coding: ascii """extract.py Extract everything between two XML tags in a (possibly poorly formed) XML document.""" from bs4 import BeautifulSoup import sys # Set the opening tag name and value opening_name = "ID" opening_text = "2" # Set the closing tag name closing_name = "dateAccessed" # Get the XML data from a file and instantiate a BeautifulSoup parser # We add a root node because the input data is missing a root with open(sys.argv[1], 'r') as xmlfile: xmldoc = "<root>" + xmlfile.read() + "</root>" soup = BeautifulSoup(xmldoc, 'xml') # Iterate through the elements of the XML data and collect # all of the elements inbetween the opening and closing tags elements = [] match = False for e in soup.find_all(): if match is True: elements.append(str(e)) if e.name==closing_name: break else: try: if e.name==opening_name and e.text==opening_text: match = True elements.append(str(e)) except AttributeError: pass # Output the results on a single line print("".join(elements))

You would run it something like this:

python extract.py data.xml

For your given example data:

<ID>1</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>3</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>4</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>

It produces the following output:

<ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>

Kusalananda · Accepted Answer · 2018-11-23 23:29:25Z

Assuming that the XML document actually has a root tag (your XML does not and is therefore not well formed), then you may use XMLstarlet like this:

xmlstarlet sel -t -m '//ID[. = 2]' \ -c . -c './following-sibling::*[position()<5]' -nl file.xml

For the given data (modified to insert <root> at the start and </root> at the end), this would return

<ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>

The XMLstarlet query selects any ID node whose contents is 2 (-m '//ID[. = 2]'). For each of these nodes (only one in the given data), it returns a copy of the node itself (-c .) along with a copy of the following five sibling nodes (-c './following-sibling::*[position()<5]'), ending the output by inserting a newline (-nl).

The <root> start and end tags could be inserted into the document itself, or be handed to XMLstarlet like so:

{ echo '<root>'; cat file.xml; echo '</root>'; } | xmlstarlet sel -t -m '//ID[. = 2]' \ -c . -c './following-sibling::*[position()<5]' -nl

Community · Accepted Answer · 2020-06-11 14:16:50Z

Grep

grep -oE '<data>[^<]*</data>' yourxmlfile

Bash

tag='data' tL="<$tag>" tR="</$tag>" xml=$(< yourxmlfile) while case $xml in *"$tL"* ) :;; * ) break;; esac; do t1=${xml#*"$tL"} t2=${t1%%"$tR"*} xml=${t1#*"$tR"} echo "${tL}${t2}${tR}" done

Perl

perl -lne "print for/<$tag>.*?<\/$tag>/g" yourxmlfile

Sed

sed -e " s|<$tag>|\n&| s/.*\n// s|</$tag>|&\n| /\n/P;D " yourxmlfile

Output

 <data>asdf</data> <data>asdf</data> <data>asdf</data> <data>asdf</data>

Kamaraj · Accepted Answer · 2017-02-27 03:32:07Z

if you want to extract the ID value, and i assume ID always comes as first tag, then you can use this

awk -F"[<>]" '{print $3}' input.txt

if you want to search for specific tag, then try this awk command. you need to change the value of input=ID

awk -F"[<>]" '{for(i=1;i<=NF;i++)if($i~input){print $(i+1);next}}' input=ID input.txt

user256118 · Accepted Answer · 2017-10-18 07:08:12Z

provided XML has no line breaks. why don't you try inserting \n between >< which will make the XML in standard format

Example:- i have created a file called stack with the given xml.

below is the sed operation to introduce line breaks.

 cat stack|sed -e 's/></>\n</g' <ID>2</ID> <data>asdf</data> <data2>asdf</data2> <dataX>asdf</dataX> <dateAccessed>somedate</dateAccessed>

now you can access the tags you want

Stack Exchange Network

How to extract data between two different xml tags

5 Answers 5

Grep

Bash

Perl

Sed

Output

You must log in to answer this question.

Hot Network Questions

How to extract data between two different xml tags

5 Answers 5

Grep

Bash

Perl

Sed

Output

You must log in to answer this question.

Related

Hot Network Questions