7

I have an xml file that contains the tag </w:rPr> several times.

It is used like this

 <w:rPr> TO REMOVE </w:rPr> 

However the content between the tag itself is sometimes different. Could there be a way to use sed or something other to delete everything between <w:rPr> and </w:rPr> and then both tags as well?

The relevant namespace

xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" 

And the file itself (formatted, valid XML)

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <root xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> <w:lvl w:ilvl="0"> <w:rPr> TO REMOVE </w:rPr> <w:rPx> <w:rFonts w:ascii="Symbol" w:hAnsi="Symbol" w:hint="default"/> </w:rPx> </w:lvl> </root> 
1
  • can't be done. Too many characters.
    – Felix
    CommentedJan 8, 2020 at 15:50

1 Answer 1

15

TL;DR

please, never ever use for this task !

Everytime you use sed for html or xml, you kill a kitty

It's a task for

(a proper XML parser) and his friend , like this:

xmlstarlet ed \ -L \ -N w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" \ -d '//w:rPr' file.xml 

A bit of explanations :

  • -L edit the file on the fly like sed -i
  • -N set the XML namespace, if needed
  • -d remove nodes matching xpath expression

Check xmlstarlet edit --help

Using basex

Pure XQuery solution:

$ cat XQuery declare namespace w = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"; copy $input := doc("/dev/stdin") modify delete node $input//w:rPr return $input $ basex XQuery < file.xml 

Using XQuery and xidel:

With limited XQuery capabilities.

xidel --xml --xquery ' declare namespace w = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"; x:replace-nodes(//w:rPr, ()) ' file.xml 

theory :

According to the compiling theory, XML/HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML/HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.

realLife©®™ everyday tool in a :

You can use one of the following :

  • xmllint often installed by default with libxml2, xpath1
  • xmlstarlet can edit, select, transform... Not installed by default, XPath1
  • xpath installed via Perl's module XML::XPath, XPath1
  • BaseX not installed by default, package basex, full XQuery 3.1
  • xidel XPath3, partial XQuery 3 (no Update)
  • saxon-lint my own project, wrapper over @Michael Kay's Saxon-HE Java library, XPath3

or you can use high level languages and proper libs, I think of :

's lxml (from lxml import etree)

's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath

, check this example

DOMXpath, check this example


Check: Using regular expressions with HTML tags

enter image description here

4
  • Would it help to provide you with the complete file? Tbh I don't know anything about xml. Hence my thought of using sed.
    – Felix
    CommentedJan 8, 2020 at 15:39
  • Sure it will help ! :)CommentedJan 8, 2020 at 15:40
  • 1
    Schroedinger's kitty?
    – bu5hman
    CommentedJan 8, 2020 at 17:12
  • 1
    Thank you so much! This has been helpful and saved me!!
    – Felix
    CommentedJan 8, 2020 at 20:27

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.