Extract an attribute value from XML

Question

Using Bash,

File:

<?xml version="1.0" encoding="UTF-8"?> <blah> <blah1 path="er" name="andy" remote="origin" branch="master" tag="true" /> <blah1 path="er/er1" name="Roger" remote="origin" branch="childbranch" tag="true" /> <blah1 path="er/er2" name="Steven" remote="origin" branch="master" tag="true" /> </blah>

I have tried the following:

grep -i 'name="andy" remote="origin" branch=".*\"' <filename>

But it returns the whole line:

<blah1 path="er" name="andy" remote="origin" branch="master" tag="true" />

I would like to match the line based on the following:

name="andy"

I just want it to return:

master

I guess I'll leave this here.
– JoL
CommentedJul 12, 2019 at 16:07 — JoL, CommentedJul 12, 2019 at 16:07

glenn jackman · Accepted Answer · 2019-07-11 20:53:25Z

45

Use an XML parser for parsing XML data. With xmlstarlet it just becomes an XPath exercise:

$ branch=$(xmlstarlet sel -t -v '//blah1[@name="andy"]/@branch' file.xml) $ echo $branch master

answered Jul 11, 2019 at 20:53

glenn jackman

88.1k16 gold badges123 silver badges176 bronze badges

11
This is the better answer since it will continue to work even after someone decided to change the order of the attributes.
– Hermann
CommentedJul 11, 2019 at 21:05
4
@Hermann Or changes the whitespace, or adds another element with attributes name="andy" branch="foo", or changes the character encoding, or puts an escaped \" in the branch attribute, or or or... I agree; just use an XML parser!
– marcelm
CommentedJul 12, 2019 at 8:16
4
branch=$(xmllint --xpath 'string(//blah1[@name="andy"]/@branch)' file.xml) is the equivalent command with xmllint.
– David Conrad
CommentedJul 12, 2019 at 17:23
3
@DavidConrad make that an answer.
– RonJohn
CommentedJul 13, 2019 at 0:05
@RonJohn Done. I also decided to change it to an absolute XPath.
– David Conrad
CommentedJul 13, 2019 at 18:03

| Show 1 more comment

Freddy · Accepted Answer · 2019-07-11 20:19:09Z

18

With grep:

grep -Pio 'name="andy".*branch="\K[^"]*' file

-P enable perl regular expressions (PCRE)
-i ignore case
-o print only matched parts

In the regex, the \K is a zero-width lookbehind to match the part before the \K, but to not include it in the match.

answered Jul 11, 2019 at 20:19

Freddy

26.1k1 gold badge26 silver badges64 bronze badges

Ah, using Grep, I tried to do this way, but I guess my knowledge was very limited and I kept getting frustrated :$
– John
CommentedJul 11, 2019 at 20:21
Wonderful solution, I learn every day.
– Edward
CommentedJul 11, 2019 at 20:28
4
Parsing XML using grep is asking for trouble. What if the order of the attributes changes? What if there's some other (non-blah1) element that has similar attributes? What if the branch name includes \"? Also, why -i? XML element and attribute names are case-sensitive. Now, all of these things are bugs waiting to surface at some point in the future. I recommend using the proper tool for the job; an XML parser.
– marcelm
CommentedJul 12, 2019 at 8:21
The -i is taken from OP and could be handy to handle the attribute values (Roger, Steven). If the branch name had an \", then it should have been escaped with \". Yes, you're right, XML may change, have line breaks, etc. pp., an XML parser is definitely the better answer, but OP asked for grep and it could be that he knows what he is doing.
– Freddy
CommentedJul 12, 2019 at 10:50

Add a comment |

David Conrad · Accepted Answer · 2019-07-13 18:00:04Z

Use xmllint to extract the value of the attribute using XPath:

xmllint --xpath 'string(/blah/blah1[@name="andy"]/@branch)' file.xml

It's better to use an XML parser to process XML since the order of the attributes can change and line breaks could be inserted resulting in the name and branch attributes being in different lines of the file.

jesse_b · Accepted Answer · 2019-07-11 20:18:11Z

Using awk:

awk '/name="andy"/{ for (i=1;i<=NF;i++) { if ($i ~ "branch=") { sub(/branch=/, ""); gsub(/"/, ""); print $i } } }' input

This will find a line containing name="andy" and then loop through each field in that line. If the field contains branch= we will remove branch= and all double quotes and print the remainder of the field.

sub(/branch=/, "") is looking for a match of branch= and replacing it with "" (nothing)

gsub is similar except it will replace globally (all occurances instead of just the first occurance).

I wish I could rate this up but another answer is better as you mentioned. — John, CommentedJul 11, 2019 at 20:21
This is good but only works if branch is on the same line with name. — David Conrad, CommentedJul 12, 2019 at 16:28
@DavidConrad: Yes that is the requirement. If you notice, branch is on every line but OP only wants to return the value of branch that is on the same line as the name. — jesse_b, CommentedJul 12, 2019 at 16:39
That isn't exactly the requirement, though, that's just the way this file happens to look. XML allows whitespace, so if you break the lines on spaces it will still work with the highest-upvoted answer but it will break with awk. It's a caveat people using this solution should be aware of. That said, this is a good quick-and-dirty solution, and I upvoted you. — David Conrad, CommentedJul 12, 2019 at 17:15

Edward · Accepted Answer · 2019-07-11 20:12:00Z

I think this works:

$ grep -i 'name="andy" remote="origin" branch=".*\"' <filename> | awk -F' ' '{print $5}' | sed -E 's/branch=\"(.*)\"/\1/' master

The awk part makes sure only branch="master" is returned, the sed part gives back what's between the double quotes with a reference (the \1 matches the part between the parentheses).

Now I know there are a lot of people out here with far more knowledge on the art that is awk and sed, so I'm prepared for some criticism :-)

But I am passing in the file thought :$ Thanks a lot for the answer, I didn't think of using awk. I don't want to read each line, I kinda want to read the whole file and do this? Not possible? — John, CommentedJul 11, 2019 at 20:10
This works, but like any solution that doesn't treat the XML as XML, it will stop working if the order of attributes changes or line breaks are inserted. — David Conrad, CommentedJul 13, 2019 at 18:05

AnJo · Accepted Answer · 2019-11-20 12:54:46Z

If you don't have access to xmllint or xmlstarlet on your machine. Make sure to transform your xml to one line before using grep like this

cat <filename> | tr -d '\n'

now you are sure that tags are not broken up on separate lines

| grep -Eo "<blah1[>\ ][^<]+name=\"andy\"[^>]+."

will cut out (like in xpath /blah1[@name="andy"])

<blah1 path="er" name="andy" remote="origin" branch="master" tag="true" />

now

| grep -oP "(?<=branch\=\")[^\"]*"

will return (like in xpath /@branch)

master

all together

cat <filename> | tr -d '\n'| grep -Eo "<blah1[>\ ][^<]+name=\"andy\"[^>]+." | grep -oP "(?<=branch\=\")[^\"]*"

Stack Exchange Network

Extract an attribute value from XML

6 Answers 6

You must log in to answer this question.

Linked

Hot Network Questions

Extract an attribute value from XML

6 Answers 6

You must log in to answer this question.

Linked

Related

Hot Network Questions