3

I'm looking for a way to edit HTML files from the command line similar to sed or awk, but using path expressions similar to jq or pup. In particular, newlines, white space and other formatting details shouldn't matter.

So I'd like to say something like "delete everything between <body> and first <p> tag following it, and replace it with the this text" or "replace every <b>...</b> with <p font-style=italic>...</p>, keeping the text in between". The rest of the document should remain unchanged.

A library for, say, Perl, Python or Haskell where I can do that easily with a few lines would also be fine (but I'd prefer a commandline tool).

Background: I want to use this to clean up lots of epub files with awkward formatting, bad language tags etc.

11
  • 2
    I use tidy to format html to one tag-per-line, and perl to process the tags. There are a few applications which do a more sophisticated job, but the majority have technical problems glossed over by their developers. Someone is certain to recommend one of those.CommentedMar 7, 2017 at 9:11
  • Cleaning up formatting is one thing, and can be from the command line with utilities like js-beautify - which has a python script in its repo. or pandoc can do this. but to replace elements in a way which can handle unusual blank characters really needs a full HTML parser. I dont know of any way that you could limit the commands to one line shell statements either - you would need to write a scriptCommentedMar 7, 2017 at 9:42
  • 2
    redirect output to new file, if all is ok, then copy on top of original
    – X Tian
    CommentedMar 7, 2017 at 17:05
  • 1
    Yes, xslt looks like it can do that, I'm still looking into it. My comments referred to "why don't you use jq or pup"? jq or pup can't do it, or I don't know how: not by redirecting output to a new file, and not in any other way I can think of. Testing xslt output by first sending it to a new file, and then overwriting the old file is trivial, but that wasn't the question.
    – dirkt
    CommentedMar 7, 2017 at 17:39
  • 1
    Here in 2022 you can use xmlstarlet to edit HTML files that are not strict XML. Let's call your file index.html; the command invocation to generate strict XML from HTML is then xmlstarlet fo -H index.html 2>/dev/null. However, the rest of your question is hard in generality but specific requirements should be possible. See unix.stackexchange.com/a/645582/100397 for one exampleCommentedApr 12, 2022 at 13:12

1 Answer 1

1

I don't know of anything that would do what you want and it would be a lot of work to build something. For starters you'd have to build a compiler, using yacc or some such, to parse your commands and then pass them on to other code to actually do the transformations.

XLST might work but I doubt it. It's sitting on top of XML and HTML is too irregular a markup language to fit inside that rigid syntax: especially if you start dumping CSS on top of it.

I'd go with the PerlHTML::Parser library (or maybe one of its friends in the HTML module tree if they have a specialized tool for a common task of yours). It parses HTML documents into a little internal database tree and then you can manipulate that and dump it back out. I use it all the time to do things such as: get rid of all iframe tags and contents; get rid of all HTML tags but print out something close to the intended formatting in plaintext; and really complex screen scraper engines.

It's really simple to use and does all of the heavy lifting for you. See the CPAN page for examples. The distribution also comes with more examples to do things like strip out certain tags/elements and/or attributes.

Remember that back in the Stone Age Perl ruled the Web and was mostly concerned with slinging HTML around so the Perl Monks have been sharpening their HTML tools for decades now.

    You must log in to answer this question.

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.