TL;DR
please, never ever use sed for this task !
Everytime you use sed
for html
or xml
, you kill a kitty

(a proper XML parser) and his friend xpath, like this:
xmlstarlet ed \ -L \ -N w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" \ -d '//w:rPr' file.xml
A bit of explanations :
-L
edit the file on the fly like sed -i
-N
set the XML namespace, if needed-d
remove nodes matching xpath
expression
Check xmlstarlet edit --help
Pure XQuery solution:
$ cat XQuery declare namespace w = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"; copy $input := doc("/dev/stdin") modify delete node $input//w:rPr return $input $ basex XQuery < file.xml
Using XQuery
and xidel
:
With limited XQuery capabilities.
xidel --xml --xquery ' declare namespace w = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"; x:replace-nodes(//w:rPr, ()) ' file.xml
theory :
According to the compiling theory, XML/HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML/HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.
realLife©®™ everyday tool in a shell :
You can use one of the following :
- xmllint often installed by default with
libxml2
, xpath1 - xmlstarlet can edit, select, transform... Not installed by default, XPath1
- xpath installed via Perl's module XML::XPath, XPath1
- BaseX not installed by default, package
basex
, full XQuery 3.1 - xidel XPath3, partial XQuery 3 (no Update)
- saxon-lint my own project, wrapper over @Michael Kay's Saxon-HE Java library, XPath3
or you can use high level languages and proper libs, I think of :
python's lxml
(from lxml import etree
)
perl's XML::LibXML
, XML::XPath
, XML::Twig::XPath
, HTML::TreeBuilder::XPath
rubynokogiri, check this example
phpDOMXpath
, check this example
Check: Using regular expressions with HTML tags
