0

I want to read a word between two xml elements using sed command.

For e.g. in below xml, I want to read the number 1234567.

 <ns1:account> <ns2:name>Corporation</ns2:name> <address> <StrtNm>NewYork</StrtNm> <BldgNb>3</BldgNb> <PstCd>230300</PstCd> <Ctry>USA</Ctry> </address> </ns1:account> <ns3:details> <ns4:accnum> <ns5:info> <nd6:accnum>1234567</nd6:accnum> </ns5:info> </ns4:accnum> </ns3:details> 

I was able to do this using a combination of grep and sed commands as below,

grep -oz '<.*details>\s*<.*accnum>\s*<.*info>\s*<.*accnum>[0-9]*</.*accnum>' test.xml |sed -n 's:.*<.*accnum>\(.*\)</.*accnum>.*:\1:p' 

but I read that grep -oz is not good for performance since it treats the entire file as a single line. So I tried with two sed commands but it only works if the file is properly formatted as the one shown above. It doesn't work if the xml comes as a single line without pretty printing. This is what I tried:

sed -n '/.*details>/,/<\/.*accnum>/p' test.xml |sed -n 's:.*<.*accnum>\(.*\)<.*accnum>:\1:p' 

Challenges:

  1. The file can come with or without namespace prefixes in the elements.
  2. The file is pretty large, about 100Mb or more.
  3. The file contents can come as a properly formatted xml or as the entire xml as a single line.

I haven't tried awk command yet since there are existing scripts in our application which use the commands listed above, and I was hoping to get the same working.

3
  • Would it be reasonable to say that you want to extract the value for the nd6:accnum element?
    – Jeff Schaller
    CommentedJul 7, 2020 at 20:03
  • @JeffSchaller yes correctCommentedJul 7, 2020 at 20:19
  • "The file contents can come as a properly formatted xml or as the entire xml as a single line". XML can be properly formatted as a single line. The two halves of your sentence are not opposites.CommentedJul 7, 2020 at 21:03

3 Answers 3

2

I've had to edit your XML to make it a well-formed document (adding the <root/> element and declaring the namespaces):

<?xml version="1.0"?> <root xmlns:ns1="urn:ns1" xmlns:ns2="urn:ns2" xmlns:ns3="urn:ns3" xmlns:ns4="urn:ns4" xmlns:ns5="urn:ns5" xmlns:nd6="urn:nd6"> <ns1:account> <ns2:name>Corporation</ns2:name> <address> <StrtNm>NewYork</StrtNm> <BldgNb>3</BldgNb> <PstCd>230300</PstCd> <Ctry>USA</Ctry> </address> </ns1:account> <ns3:details> <ns4:accnum> <ns5:info> <nd6:accnum>1234567</nd6:accnum> </ns5:info> </ns4:accnum> </ns3:details> </root> 

Having done that I can use xmlstarlet to parse the XML file and extract precisely the element you need

xmlstarlet sel -t -v '//nd6:accnum' -n x.xml 1234567 

You can modify the XPath to be more precise, as necessary. For example /root/ns3:details/ns4:accnum/ns5:info/nd6:accnum would be an extreme option.

If you don't have xmlstarlet available I strongly recommend you install it. If the system is not yours to manage, make it a prerequisite of the project you're on. Trying to parse an XML file with sed and awk will work in the short-term but it's setting up technical debt down the road, particularly if you have little control over the precise layout of the XML document (whitespace, newlines, comments, etc.).

    0

    Using xidel and a valid xml input (see @roaima answer), we can:

    xidel -se '//nd6:accnum/text()' file.xml 

    where

    • //nd6:accnum/text() is a XPath expression meant to -- look for element "nd6:accnum" anywhere and select its text.
      0

      This one-liner perl command will print the expected result:

      perl -lne 'print "$1" if /<nd6:accnum>(\w+)</' file.xml 1234567 

        You must log in to answer this question.

        Start asking to get answers

        Find the answer to your question by asking.

        Ask question

        Explore related questions

        See similar questions with these tags.