0

Consider my humble hello.html file, edited with mighty ed:

$ ed hello.html 28 ,p <title>Hello world!</title> 

What's your general approach to edit inside that title HTML tag (bonus if you can edit inside any HTML tag)?

I tried a regular expression that matches inside the tag:

s/>.*/>My new title/p <title>My new title u . <title>Hello world!</title> 

But, sadly, you can see that I chopped my tag (and it would be way too much work to type out that </title> bit every time!).

For further education, I browsed through Software Tools in Pascal page to 174—see https://archive.org/details/softwaretoolsinp00kern/page/174/mode/1up?view=theater page—and discovered the & special character that helpfully reaches the middle of the sentence:

s/world/& again/p <title>Hello world again!</title> 

But, that's not quite right, since I want to substitute the middle, not just reach the middle.

1
  • 2
    For a quick change, I would do something like s/>[^>]*</>My new title</, but you'd have to provide some representative input and output for us to say what might work.
    – muru
    CommentedFeb 6, 2024 at 6:29

3 Answers 3

2

You can use [^<] instead of . to match any character other than < instead of any character.

28 ed> ,n 1 <title>Hello world!</title> ed> s/>[^<]*/>new title/ ed> ,n 1 <title>new title</title> 

Another approach could be to insert newlines before and after each < or > so the thing you want to change is on its own line which you can change with c:

28 ed> ,n 1 <title>Hello world!</title> ed> s/[<>]/\ &\ /g ed> ,n 1 2 < 3 title 4 > 5 Hello world! 6 < 7 /title 8 > 9 ed> 5c new title . ed> ,n 1 2 < 3 title 4 > 5 new title 6 < 7 /title 8 > 9 ed> 1,9j ed> ,n 1 <title>new title</title> 
1
2

A better way is to use an HTML-aware parser and use that to edit content. My preferred tool is xmlstarlet because although it's an XML parser/editor it can also handle HTML:

Create a sample page

cat >my.html <<'EOF' <html> <title>Hello world!</title> <body><p>Thank you for reading my page</p></body> </html> EOF 

Replace Hello world! with Hello everyone!:

xmlstarlet format --html my.html 2>/dev/null | xmlstarlet edit --omit-decl --update '//title' --value 'Hello everyone!' <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html> <head> <title>Hello everyone!</title> </head> <body> <p>Thank you for reading my page</p> </body> </html> 

Output is written to stdout, and the usual approach here is to write it to a temporary file and then replace the original. This isn't perfect but it's probably acceptable:

file=my.html ( [ "${file#/}" = "$file" ] && file="./$file" xmlstarlet format --html "$file" 2>/dev/null | xmlstarlet edit --omit-decl --update '//title' --value 'Hello everyone!' >"$file.tmp" && cp -p -- "$file" "$file.old" && mv -f -- "$file.tmp" "$file" ) 

Note that if $file starts with - you will get errors from xmlstarlet and you cannot use -- to separate it from true options. What we do here is to check whether the file name is absolute, and if not then we prepend ./. You can omit the cp line if you do not need to save a copy of the original content.

2
  • xmlstarlet is great, with good documentation, but I find it hard to learn.CommentedFeb 6, 2024 at 11:38
  • @reinierpost, there are a number of posts here on U&L and others on StackOverflow that show examples of its use. It is fiddly and the documentation isn't always as clear as it could be (I think it lacks real-world examples) but it can help particularly when you cannot guarantee the shape of the file structureCommentedFeb 6, 2024 at 11:46
1

You shouldn't use a regex to parse HTML. See https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

If you want to do it with ed the below will do it for the HTML tag you give it. But it might be better to use sed. This works because you can use any character with s, it doesn't have to be s/old/new/ it can be s|old|new| or s!old!new!.

$ ed hello.html 28 ,p <title>Hello world!</title> s|<title>.*</title>|<title>foo</title>| ,p <title>foo</title> 

The / characters may be uniformly replaced by any other single character within any given s command. The / character (or whatever other character is used in its stead) can appear in the regexp or replacement only if it is preceded by a \ character.

From https://www.gnu.org/software/sed/manual/html_node/The-_0022s_0022-Command.html

    You must log in to answer this question.

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.