2

I have looked but haven't been able to find anyone else with the same sort of problem I have.

I have an xml file like this:

<ID>1</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>3</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>4</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed> 

Basically a whole bunch of data all on one line, no line breaks. I need to extract the info (preferably just as-is with tags intact) between a specific < ID> tag (eg < ID>2 )and the very next < /dateAccessed> tag. I have about 50 files to check for a particular ID and the following related data. I get that this is not standard, there is no nesting.

I originally tried to do this using grep and sed, but I just get the whole file returned, which seems odd to me. Can't I just treat this like a text file?

EDIT:

I didn't realise the formatter removed text that was in enclosing < and > , so after re-reading my question this morning, I realised it's asking something completely different. TL;DR I need what is between a specific value between ID tags and the next closing DateAccessed tag. Not between the same opening and closing tags, ie between ID and /ID

So I can get something like this result:

<ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed> 
1
  • 4
    I can't help but feel this is the "wrong question". If you're working with XML files you should really be using an XML parser (such as xmlstarlet). I appreciate this won't give you an unbalanced segment, and so is not a suitable answer to your question as asked. But trying to treat XML as text will almost certainly lead to unintended consequences down the road. It's not a good place to be. Really.CommentedOct 18, 2017 at 7:14

5 Answers 5

1

As noted in the comments, your data isn't well-formed XML and it isn't completely clear what the structure of your document is, e.g. judging by your example data, it looks like you have no nested elements - is that really the case?

With that caveat in mind, here's a Python script that uses the BeautifulSoup4 parsing library to do what you want (i.e. it produces the desired output data for the given example input data):

#!/usr/bin/env python # coding: ascii """extract.py Extract everything between two XML tags in a (possibly poorly formed) XML document.""" from bs4 import BeautifulSoup import sys # Set the opening tag name and value opening_name = "ID" opening_text = "2" # Set the closing tag name closing_name = "dateAccessed" # Get the XML data from a file and instantiate a BeautifulSoup parser # We add a root node because the input data is missing a root with open(sys.argv[1], 'r') as xmlfile: xmldoc = "<root>" + xmlfile.read() + "</root>" soup = BeautifulSoup(xmldoc, 'xml') # Iterate through the elements of the XML data and collect # all of the elements inbetween the opening and closing tags elements = [] match = False for e in soup.find_all(): if match is True: elements.append(str(e)) if e.name==closing_name: break else: try: if e.name==opening_name and e.text==opening_text: match = True elements.append(str(e)) except AttributeError: pass # Output the results on a single line print("".join(elements)) 

You would run it something like this:

python extract.py data.xml 

For your given example data:

<ID>1</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>3</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>4</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed> 

It produces the following output:

<ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed> 
    1

    Assuming that the XML document actually has a root tag (your XML does not and is therefore not well formed), then you may use XMLstarlet like this:

    xmlstarlet sel -t -m '//ID[. = 2]' \ -c . -c './following-sibling::*[position()<5]' -nl file.xml 

    For the given data (modified to insert <root> at the start and </root> at the end), this would return

    <ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed> 

    The XMLstarlet query selects any ID node whose contents is 2 (-m '//ID[. = 2]'). For each of these nodes (only one in the given data), it returns a copy of the node itself (-c .) along with a copy of the following five sibling nodes (-c './following-sibling::*[position()<5]'), ending the output by inserting a newline (-nl).

    The <root> start and end tags could be inserted into the document itself, or be handed to XMLstarlet like so:

    { echo '<root>'; cat file.xml; echo '</root>'; } | xmlstarlet sel -t -m '//ID[. = 2]' \ -c . -c './following-sibling::*[position()<5]' -nl 
      -1

      Grep

      grep -oE '<data>[^<]*</data>' yourxmlfile 

      Bash

      tag='data' tL="<$tag>" tR="</$tag>" xml=$(< yourxmlfile) while case $xml in *"$tL"* ) :;; * ) break;; esac; do t1=${xml#*"$tL"} t2=${t1%%"$tR"*} xml=${t1#*"$tR"} echo "${tL}${t2}${tR}" done 

      Perl

      perl -lne "print for/<$tag>.*?<\/$tag>/g" yourxmlfile 

      Sed

      sed -e " s|<$tag>|\n&| s/.*\n// s|</$tag>|&\n| /\n/P;D " yourxmlfile 

      Output

       <data>asdf</data> <data>asdf</data> <data>asdf</data> <data>asdf</data> 
        -2

        if you want to extract the ID value, and i assume ID always comes as first tag, then you can use this

        awk -F"[<>]" '{print $3}' input.txt 

        if you want to search for specific tag, then try this awk command. you need to change the value of input=ID

        awk -F"[<>]" '{for(i=1;i<=NF;i++)if($i~input){print $(i+1);next}}' input=ID input.txt 
          -4

          provided XML has no line breaks. why don't you try inserting \n between >< which will make the XML in standard format

          Example:- i have created a file called stack with the given xml.

          below is the sed operation to introduce line breaks.

           cat stack|sed -e 's/></>\n</g' <ID>2</ID> <data>asdf</data> <data2>asdf</data2> <dataX>asdf</dataX> <dateAccessed>somedate</dateAccessed> 

          now you can access the tags you want

            You must log in to answer this question.

            Start asking to get answers

            Find the answer to your question by asking.

            Ask question

            Explore related questions

            See similar questions with these tags.