3

I've got to replace some attribute content in an XML tag, depending on a parameter $1.

We've got in input, for example:

<ThreadGroup guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="OO CSS DPM PRI" enabled="true"> <ThreadGroup guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="AA CSS DPM PRI" enabled="true"> 

If the testname attribute does not contain $1, then replace the enabled value with false; otherwise (testnamedoes contain $1), replace the enabled value with true.

NOTE: it's possible to encounter other attributes, more than in this example.

I thought about sed but maybe other tools can do it better?

5
  • 2
    Use an XML parser. xmlstarlet for example. Or one of the many XML libraries for languages like perl or python, or almost any other language you might care to write in.
    – cas
    CommentedMay 12, 2016 at 0:38
  • I used it before, but in dev environment. My SYS team doesn't want to add software in test or prod if it's not necessary. Here, if sed can do it, no need xmlstarlet (even if it's a great tool !)
    – buzz2buzz
    CommentedMay 12, 2016 at 5:33
  • 1
    sed can in some very specific and simple cases make some simple changes to a text stream that happens to contain XML. It can't, in the general case, reliably edit XML. Neither can any other regular-expression based method or tool. The only way to do it reliably is to use an XML parser. If you can't convince the sysad team to install any extra tools, what language is used for your main production code? There will probably be an XML parser for that....if your code is producing XML output or uses XML data then you probably even have it already installed.
    – cas
    CommentedMay 12, 2016 at 5:42
  • In case of some specific code points, yes we've got Java XML librairies. Here it's sys scope. As I said if I can make it works with sed no need to use xmlstarlet or another tool. In my case, I only edit the content of a tag. I don't need to check inner or outer tags, it's no 'difficult' things. So, I personally think that sed is enough. In conclusion yes an XML parser is the best, but here sed might be OK. If our needs increases in complexity (XML or HTML complexity I mean) I will do my best to make an XML parser be installed on our servers.
    – buzz2buzz
    CommentedMay 12, 2016 at 5:50
  • 2
    Please - don't make it work in sed. That's about on a par with putting screws in with a hammer. It sort of works, but the result is ugly and not as robust. And a screwdriver isn't exactly a hard thing to acquire.
    – Sobrique
    CommentedMay 13, 2016 at 12:35

6 Answers 6

4

No one has said it yet, so I will. PLEASE don't parse XML using regular expressions. XML is a contextual language, and regular expressions aren't. This means you create brittle code, that one day might just break messily.

For more examples, see: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

PLEASE use a parser. They exist in many languages - personally, I like perl, and your task goes a bit like this:

#!/usr/bin/env perl use strict; use warnings; #parser library use XML::Twig; #ingest data my $twig = XML::Twig -> parse (\*DATA); #iterate all tags <ThreadGroup> foreach my $group ( $twig -> get_xpath('//ThreadGroup') ) { #check testname regex match if ( $group -> att('testname') =~ /AA/ ) { #set enabled $group -> set_att('enabled', 'true'); } else { #set disabled $group -> set_att('enabled', 'false'); } } #pretty print options vary, see man page. $twig -> set_pretty_print('indented_a'); $twig -> print; __DATA__ <xml> <ThreadGroup guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="OO CSS DPM PRI" enabled="true" /> <ThreadGroup guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="AA CSS DPM PRI" enabled="true" /> </xml> 

And yes - it is necessary to use an XML parser, because regular expressions cannot do it safely. There are bunch of things in XML that are semantically identical, like attribute ordering, line feeds, unary tags etc. that aren't the same regex. But a parser won't be caught out by it.

The above can be cut down to a one liner if you prefer:

perl -MXML::Twig -e 'XML::Twig -> new ( twig_handlers => { ThreadGroup => sub { $_ -> set_att("enabled", $_ -> att("testname") =~ /AA/ ? "true" : "false" ) } } ) -> parsefile_inplace("yourfile")' 

Your sysadmin team should thank you for doing this (that's not to say they will) because any solution based on regular expressions might break one day, for no apparent reason.

As a most trivial example - your XML is semantically identical as:

<xml> <ThreadGroup enabled="true" guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="OO CSS DPM PRI" /> <ThreadGroup enabled="true" guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="AA CSS DPM PRI" /> </xml> 

Or:

<xml> <ThreadGroup enabled="true" guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="OO CSS DPM PRI"/> <ThreadGroup enabled="true" guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="AA CSS DPM PRI"/> </xml> 

Or:

<xml><ThreadGroup enabled="true" guiclass="ThreadGroupGui" testclass="ThreadGrou p" testname="OO CSS DPM PRI"/><ThreadGroup enabled="true" guiclass="ThreadGroupG ui" testclass="ThreadGroup" testname="AA CSS DPM PRI"/></xml> 

Or:

<xml ><ThreadGroup enabled="true" guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="OO CSS DPM PRI" /><ThreadGroup enabled="true" guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="AA CSS DPM PRI" /></xml> 

And that's before we get into attribute ordering, possible tag nesting, or other substring that 'match' in places you won't expect.

3
  • 2
    excellent answer, the page you linked to is one of the most famous stackoverflow posts ever, but I never really understood why parsing HTML/XML with regex is so bad, but two lines into your explanation I was like oh snap, yeah, wow DON'T parse with regex. Also as you have pointed out there are parsers out there that actually make the job easier than trying to do it with e.g. sedCommentedMay 13, 2016 at 12:50
  • Thanks for your reply. I'm aware of that stuff. If one day it'll fail, I'll push as hard as I can for an XML parser, for the moment I can't. Don't blame me please it's not my decisions
    – buzz2buzz
    CommentedMay 13, 2016 at 12:54
  • 1
    I don't blame you - we all get put in positions we'd rather not be in. What I'm trying to point out is that this a seriously messy thing to be doing, and is creating technical debt. The analogy I like is that of using a hammer instead of a screwdriver - it might technically work, but you wouldn't do it because it's messy and only a halfway solution. You'd go buy a screwdriver instead. They're cheap and readily available. (just like XML::Twig and xmlstarlet are).
    – Sobrique
    CommentedMay 13, 2016 at 12:58
1

Using XMLStarlet:

#!/bin/sh xml ed -u "//ThreadGroup[. = contains(@testname, '$1')]/@enabled" -v "true" -u "//ThreadGroup[. = not(contains(@testname, '$1'))]/@enabled" -v "false" 

Assuming your XML is valid (I added a <SomeTag> root tag, and properly delimited the empty <ThreadGroup> node with />. I also set the enabled attributes to "hello" so that the script actually does something):

$ cat data.xml <SomeTag> <ThreadGroup guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="OO CSS DPM PRI" enabled="hello"/> <ThreadGroup guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="AA CSS DPM PRI" enabled="hello"/> </SomeTag> $ sh script.sh "OO" <data.xml <?xml version="1.0"?> <SomeTag> <ThreadGroup guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="OO CSS DPM PRI" enabled="true"/> <ThreadGroup guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="AA CSS DPM PRI" enabled="false"/> </SomeTag> 
    0

    If I understood the logic correctly, this sed command searches for the given $1 parameter inside of the testname value; if it's present, then search and replace the enabled value from false to true. If it's not (!) present, then replace the enabled value from true to false.

    sed '/ testname="[^"]*'"$1"'[^"]*"/ s/ enabled="false"/ enabled="true"/; / testname="[^"]*'"$1"'[^"]*"/!s/ enabled="true"/ enabled="false"/' input > output 

    I tried to help the regex matching by giving leading spaces before the attribute names (both testname and enabled), and by using the [^"] character class.

    7
    • 2
      note that sed is line-based, and trying to parse HTML/XML in a line-based fashion is doomed -- test this on your actual input before relying on it!
      – Jeff Schaller
      CommentedMay 11, 2016 at 14:31
    • I will try your solution and I agree that this approach is not the best. If I could, I would use another tool using Xpath or something
      – buzz2buzz
      CommentedMay 12, 2016 at 5:37
    • @buzz2buzz This is about as good as it gets with using sed to edit XML but it assumes that the enabled attribute doesn't exist in any earlier tag on the same line, and/or that the testname attribute doesn't exist in another tag anywhere on the same line, and/or that there is only one XML tag per input line. It wouldn't be hard to think of a few more common scenarios where the simple approach of using regexps to edit XML can easily fail.....and the more of those cases you try to cover, the more complicated and fragile (and less readable and modifiable) the sed script becomes.
      – cas
      CommentedMay 12, 2016 at 6:02
    • I see. Don't worry if the job fails due to this kind of error, I'll push for the appropriate tool
      – buzz2buzz
      CommentedMay 12, 2016 at 6:24
    • There are lot of reasons why this is brittle. That's not a fault of this answer, but because regular expressions simply cannot handle XML correctly.
      – Sobrique
      CommentedMay 13, 2016 at 12:30
    0

    Given nonancient GNU awk (gawk) and assuming each pair of testname and enabled are on the same line separate from any other pair (which is not a given for XML in general):

     awk 'match($0,/ testname="([^"]+)"/,a) {sub(/ enabled="[^"]+"/, " enabled=\"" (a[1]~/AA/?"true":"false") "\"")} 1' <input 

    Explanation:

    match($0,/ testname="([^"]+)"/,a) returns nonzero if input line contains a substring like testname="abc" AND as a side effect puts the match and (single) submatch in array a.

    {sub(/ enabled="[^"]+"/, " enabled=\"" (a[1]~/AA/?"true":"false") "\"")} executes if the match returned nonzero and looks for a substring like enabled="def" and if found replaces it with a string in the form enabled="ghi" where ghi is true or false depending on whether the submatch from the match (which is the value of testname) itself matches the regular expression AA which for ordinary characters does so if those characters occur as a substring.

    If the string you are looking for contains any regexp special character(s) /.?*+[](){}\ but you want them matched as actual characters NOT as a regexp, you must either backslash them or (probably easier) instead use index(a[1],"AA") which 'succeeds' if AA matches as an exact substring (not regexp) -- but any backslash or doublequote in the string literal must still be backslashed.

    1 matches and (by default) prints each line, after the change above if any

    If you don't have gawk but do have perl it can do the same thing with slightly different syntax:

     perl -ne 'if(/ testname="([^"]+)"/){ $x=$1=~/AA/?"true":"false"; s/ enabled="[^"]+"/ enabled="$x"/ };print' input 

    plus can replace the original file with -i, but I rarely see systems with perl but not gawk.

    PS: your posted XML consists of opening tags, so if repeated it will be indefinitely nested, probably not what you want, and if there aren't unshown closing tags it isn't even wellformed.

      0
      #/bin/sh sed -r 'h s/.*testname="([^"]*)".*/\1/ /\b'"$1"'\b/{ g; s/enabled="[^"]*"/enabled="true"/; b} g s/enabled="[^"]*"/enabled="false"/ ' file 
        -2

        Sorry, try this:

        cat xml | sed 's/\(testname\=\".*AA.*\"\s\)enabled="\(true\|false\)"/\1enabled=\"true\"/gi' 
        2
        • 1
          This works only if it's true and then goes false. I need the other way too, sorry
          – buzz2buzz
          CommentedMay 11, 2016 at 12:22
        • try this, I've tested and works fineCommentedMay 11, 2016 at 13:26

        You must log in to answer this question.

        Start asking to get answers

        Find the answer to your question by asking.

        Ask question

        Explore related questions

        See similar questions with these tags.