18

Using Bash,

File:

<?xml version="1.0" encoding="UTF-8"?> <blah> <blah1 path="er" name="andy" remote="origin" branch="master" tag="true" /> <blah1 path="er/er1" name="Roger" remote="origin" branch="childbranch" tag="true" /> <blah1 path="er/er2" name="Steven" remote="origin" branch="master" tag="true" /> </blah> 

I have tried the following:

grep -i 'name="andy" remote="origin" branch=".*\"' <filename> 

But it returns the whole line:

<blah1 path="er" name="andy" remote="origin" branch="master" tag="true" /> 

I would like to match the line based on the following:

name="andy" 

I just want it to return:

master 
1

6 Answers 6

45

Use an XML parser for parsing XML data. With it just becomes an XPath exercise:

$ branch=$(xmlstarlet sel -t -v '//blah1[@name="andy"]/@branch' file.xml) $ echo $branch master 
6
  • 11
    This is the better answer since it will continue to work even after someone decided to change the order of the attributes.
    – Hermann
    CommentedJul 11, 2019 at 21:05
  • 4
    @Hermann Or changes the whitespace, or adds another element with attributes name="andy" branch="foo", or changes the character encoding, or puts an escaped \" in the branch attribute, or or or... I agree; just use an XML parser!
    – marcelm
    CommentedJul 12, 2019 at 8:16
  • 4
    branch=$(xmllint --xpath 'string(//blah1[@name="andy"]/@branch)' file.xml) is the equivalent command with xmllint.CommentedJul 12, 2019 at 17:23
  • 3
    @DavidConrad make that an answer.
    – RonJohn
    CommentedJul 13, 2019 at 0:05
  • @RonJohn Done. I also decided to change it to an absolute XPath.CommentedJul 13, 2019 at 18:03
18

With grep:

grep -Pio 'name="andy".*branch="\K[^"]*' file 
  • -P enable perl regular expressions (PCRE)
  • -i ignore case
  • -o print only matched parts

In the regex, the \K is a zero-width lookbehind to match the part before the \K, but to not include it in the match.

4
  • Ah, using Grep, I tried to do this way, but I guess my knowledge was very limited and I kept getting frustrated :$
    – John
    CommentedJul 11, 2019 at 20:21
  • Wonderful solution, I learn every day.
    – Edward
    CommentedJul 11, 2019 at 20:28
  • 4
    Parsing XML using grep is asking for trouble. What if the order of the attributes changes? What if there's some other (non-blah1) element that has similar attributes? What if the branch name includes \"? Also, why -i? XML element and attribute names are case-sensitive. Now, all of these things are bugs waiting to surface at some point in the future. I recommend using the proper tool for the job; an XML parser.
    – marcelm
    CommentedJul 12, 2019 at 8:21
  • The -i is taken from OP and could be handy to handle the attribute values (Roger, Steven). If the branch name had an \", then it should have been escaped with \&quot;. Yes, you're right, XML may change, have line breaks, etc. pp., an XML parser is definitely the better answer, but OP asked for grep and it could be that he knows what he is doing.
    – Freddy
    CommentedJul 12, 2019 at 10:50
12

Use xmllint to extract the value of the attribute using XPath:

xmllint --xpath 'string(/blah/blah1[@name="andy"]/@branch)' file.xml 

It's better to use an XML parser to process XML since the order of the attributes can change and line breaks could be inserted resulting in the name and branch attributes being in different lines of the file.

    3

    Using awk:

    awk '/name="andy"/{ for (i=1;i<=NF;i++) { if ($i ~ "branch=") { sub(/branch=/, ""); gsub(/"/, ""); print $i } } }' input 

    This will find a line containing name="andy" and then loop through each field in that line. If the field contains branch= we will remove branch= and all double quotes and print the remainder of the field.

    sub(/branch=/, "") is looking for a match of branch= and replacing it with "" (nothing)

    gsub is similar except it will replace globally (all occurances instead of just the first occurance).

    5
    • Thank you so much, I will google to understand sub and gsub
      – John
      CommentedJul 11, 2019 at 20:16
    • I wish I could rate this up but another answer is better as you mentioned.
      – John
      CommentedJul 11, 2019 at 20:21
    • This is good but only works if branch is on the same line with name.CommentedJul 12, 2019 at 16:28
    • @DavidConrad: Yes that is the requirement. If you notice, branch is on every line but OP only wants to return the value of branch that is on the same line as the name.
      – jesse_b
      CommentedJul 12, 2019 at 16:39
    • That isn't exactly the requirement, though, that's just the way this file happens to look. XML allows whitespace, so if you break the lines on spaces it will still work with the highest-upvoted answer but it will break with awk. It's a caveat people using this solution should be aware of. That said, this is a good quick-and-dirty solution, and I upvoted you.CommentedJul 12, 2019 at 17:15
    1

    I think this works:

    $ grep -i 'name="andy" remote="origin" branch=".*\"' <filename> | awk -F' ' '{print $5}' | sed -E 's/branch=\"(.*)\"/\1/' master 

    The awk part makes sure only branch="master" is returned, the sed part gives back what's between the double quotes with a reference (the \1 matches the part between the parentheses).

    Now I know there are a lot of people out here with far more knowledge on the art that is awk and sed, so I'm prepared for some criticism :-)

    3
    • But I am passing in the file thought :$ Thanks a lot for the answer, I didn't think of using awk. I don't want to read each line, I kinda want to read the whole file and do this? Not possible?
      – John
      CommentedJul 11, 2019 at 20:10
    • Editing my answer to show you how to pipe it through.
      – Edward
      CommentedJul 11, 2019 at 20:11
    • This works, but like any solution that doesn't treat the XML as XML, it will stop working if the order of attributes changes or line breaks are inserted.CommentedJul 13, 2019 at 18:05
    0

    If you don't have access to xmllint or xmlstarlet on your machine. Make sure to transform your xml to one line before using grep like this

    cat <filename> | tr -d '\n' 

    now you are sure that tags are not broken up on separate lines

    | grep -Eo "<blah1[>\ ][^<]+name=\"andy\"[^>]+." 

    will cut out (like in xpath /blah1[@name="andy"])

    <blah1 path="er" name="andy" remote="origin" branch="master" tag="true" /> 

    now

    | grep -oP "(?<=branch\=\")[^\"]*" 

    will return (like in xpath /@branch)

    master

    all together

    cat <filename> | tr -d '\n'| grep -Eo "<blah1[>\ ][^<]+name=\"andy\"[^>]+." | grep -oP "(?<=branch\=\")[^\"]*" 

      You must log in to answer this question.

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.