Read a word in XML between elements using sed command

Question

I want to read a word between two xml elements using sed command.

For e.g. in below xml, I want to read the number 1234567.

 <ns1:account> <ns2:name>Corporation</ns2:name> <address> <StrtNm>NewYork</StrtNm> <BldgNb>3</BldgNb> <PstCd>230300</PstCd> <Ctry>USA</Ctry> </address> </ns1:account> <ns3:details> <ns4:accnum> <ns5:info> <nd6:accnum>1234567</nd6:accnum> </ns5:info> </ns4:accnum> </ns3:details>

I was able to do this using a combination of grep and sed commands as below,

grep -oz '<.*details>\s*<.*accnum>\s*<.*info>\s*<.*accnum>[0-9]*</.*accnum>' test.xml |sed -n 's:.*<.*accnum>\(.*\)</.*accnum>.*:\1:p'

but I read that grep -oz is not good for performance since it treats the entire file as a single line. So I tried with two sed commands but it only works if the file is properly formatted as the one shown above. It doesn't work if the xml comes as a single line without pretty printing. This is what I tried:

sed -n '/.*details>/,/<\/.*accnum>/p' test.xml |sed -n 's:.*<.*accnum>\(.*\)<.*accnum>:\1:p'

Challenges:

The file can come with or without namespace prefixes in the elements.
The file is pretty large, about 100Mb or more.
The file contents can come as a properly formatted xml or as the entire xml as a single line.

I haven't tried awk command yet since there are existing scripts in our application which use the commands listed above, and I was hoping to get the same working.

Would it be reasonable to say that you want to extract the value for the nd6:accnum element? — Jeff Schaller, CommentedJul 7, 2020 at 20:03
"The file contents can come as a properly formatted xml or as the entire xml as a single line". XML can be properly formatted as a single line. The two halves of your sentence are not opposites. — Chris Davies, CommentedJul 7, 2020 at 21:03

Chris Davies · Accepted Answer · 2020-07-07 20:55:54Z

I've had to edit your XML to make it a well-formed document (adding the <root/> element and declaring the namespaces):

<?xml version="1.0"?> <root xmlns:ns1="urn:ns1" xmlns:ns2="urn:ns2" xmlns:ns3="urn:ns3" xmlns:ns4="urn:ns4" xmlns:ns5="urn:ns5" xmlns:nd6="urn:nd6"> <ns1:account> <ns2:name>Corporation</ns2:name> <address> <StrtNm>NewYork</StrtNm> <BldgNb>3</BldgNb> <PstCd>230300</PstCd> <Ctry>USA</Ctry> </address> </ns1:account> <ns3:details> <ns4:accnum> <ns5:info> <nd6:accnum>1234567</nd6:accnum> </ns5:info> </ns4:accnum> </ns3:details> </root>

Having done that I can use xmlstarlet to parse the XML file and extract precisely the element you need

xmlstarlet sel -t -v '//nd6:accnum' -n x.xml 1234567

You can modify the XPath to be more precise, as necessary. For example /root/ns3:details/ns4:accnum/ns5:info/nd6:accnum would be an extreme option.

If you don't have xmlstarlet available I strongly recommend you install it. If the system is not yours to manage, make it a prerequisite of the project you're on. Trying to parse an XML file with sed and awk will work in the short-term but it's setting up technical debt down the road, particularly if you have little control over the precise layout of the XML document (whitespace, newlines, comments, etc.).

JJoao · Accepted Answer · 2020-07-07 21:24:34Z

Using xidel and a valid xml input (see @roaima answer), we can:

xidel -se '//nd6:accnum/text()' file.xml

where

//nd6:accnum/text() is a XPath expression meant to -- look for element "nd6:accnum" anywhere and select its text.

LeadingEdger · Accepted Answer · 2020-07-08 03:20:55Z

This one-liner perl command will print the expected result:

perl -lne 'print "$1" if /<nd6:accnum>(\w+)</' file.xml 1234567

Stack Exchange Network

Read a word in XML between elements using sed command

3 Answers 3

You must log in to answer this question.

Hot Network Questions

Read a word in XML between elements using sed command

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions