How to remove nodes from XML file as command line with namespace?

Question

I have an xml file that contains the tag </w:rPr> several times.

It is used like this

 <w:rPr> TO REMOVE </w:rPr>

However the content between the tag itself is sometimes different. Could there be a way to use sed or something other to delete everything between <w:rPr> and </w:rPr> and then both tags as well?

The relevant namespace

xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"

And the file itself (formatted, valid XML)

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <root xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> <w:lvl w:ilvl="0"> <w:rPr> TO REMOVE </w:rPr> <w:rPx> <w:rFonts w:ascii="Symbol" w:hAnsi="Symbol" w:hint="default"/> </w:rPx> </w:lvl> </root>

can't be done. Too many characters.
– Felix
CommentedJan 8, 2020 at 15:50 — Felix, CommentedJan 8, 2020 at 15:50

Gilles Quénot · Accepted Answer · 2023-04-24 02:07:52Z

TL;DR

please, never ever use sed for this task !

Everytime you use sed for html or xml, you kill a kitty

It's a task for xmlstarlet

(a proper XML parser) and his friend xpath, like this:

xmlstarlet ed \ -L \ -N w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" \ -d '//w:rPr' file.xml

A bit of explanations :

-L edit the file on the fly like sed -i
-N set the XML namespace, if needed
-d remove nodes matching xpath expression

Check xmlstarlet edit --help

Using `basex`

Pure XQuery solution:

$ cat XQuery declare namespace w = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"; copy $input := doc("/dev/stdin") modify delete node $input//w:rPr return $input $ basex XQuery < file.xml

Using `XQuery` and `xidel`:

With limited XQuery capabilities.

xidel --xml --xquery ' declare namespace w = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"; x:replace-nodes(//w:rPr, ()) ' file.xml

theory :

According to the compiling theory, XML/HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML/HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.

You can use one of the following :

xmllint often installed by default with libxml2, xpath1
xmlstarlet can edit, select, transform... Not installed by default, XPath1
xpath installed via Perl's module XML::XPath, XPath1
BaseX not installed by default, package basex, full XQuery 3.1
xidel XPath3, partial XQuery 3 (no Update)
saxon-lint my own project, wrapper over @Michael Kay's Saxon-HE Java library, XPath3

or you can use high level languages and proper libs, I think of :

python's lxml (from lxml import etree)

perl's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath

ruby nokogiri, check this example

phpDOMXpath, check this example

Check: Using regular expressions with HTML tags

Would it help to provide you with the complete file? Tbh I don't know anything about xml. Hence my thought of using sed. — Felix, CommentedJan 8, 2020 at 15:39

Stack Exchange Network

How to remove nodes from XML file as command line with namespace?

1 Answer 1

TL;DR

It's a task for xmlstarlet

A bit of explanations :

Using `basex`

Using `XQuery` and `xidel`:

theory :

realLife©®™ everyday tool in a shell :

or you can use high level languages and proper libs, I think of :

You must log in to answer this question.

Linked

Hot Network Questions

How to remove nodes from XML file as command line with namespace?

1 Answer 1

TL;DR

It's a task for xmlstarlet

A bit of explanations :

Using basex

Using XQuery and xidel:

theory :

realLife©®™ everyday tool in a shell :

or you can use high level languages and proper libs, I think of :

You must log in to answer this question.

Linked

Related

Hot Network Questions

Using `basex`

Using `XQuery` and `xidel`: