I'm looking for a way to edit HTML files from the command line similar to sed
or awk
, but using path expressions similar to jq
or pup. In particular, newlines, white space and other formatting details shouldn't matter.
So I'd like to say something like "delete everything between <body>
and first <p>
tag following it, and replace it with the this text" or "replace every <b>
...</b>
with <p font-style=italic>
...</p>
, keeping the text in between". The rest of the document should remain unchanged.
A library for, say, Perl, Python or Haskell where I can do that easily with a few lines would also be fine (but I'd prefer a commandline tool).
Background: I want to use this to clean up lots of epub files with awkward formatting, bad language tags etc.
tidy
to format html to one tag-per-line, and perl to process the tags. There are a few applications which do a more sophisticated job, but the majority have technical problems glossed over by their developers. Someone is certain to recommend one of those.jq
orpup
can't do it, or I don't know how: not by redirecting output to a new file, and not in any other way I can think of. Testingxslt
output by first sending it to a new file, and then overwriting the old file is trivial, but that wasn't the question.xmlstarlet
to edit HTML files that are not strict XML. Let's call your fileindex.html
; the command invocation to generate strict XML from HTML is thenxmlstarlet fo -H index.html 2>/dev/null
. However, the rest of your question is hard in generality but specific requirements should be possible. See unix.stackexchange.com/a/645582/100397 for one example