4

I want to remove all Placemarks from a KML file that contain the element <tessellate>. The following block should be wholly removed:

<Placemark> <styleUrl>#m_ylw-pushpin330</styleUrl> <LineString> <tessellate>1</tessellate> <coordinates> 0.0000000000000,0.0000000000000,0 0.0000000000000,0.0000000000000,0 </coordinates> </LineString> </Placemark> 

I have tried some non-greedy perl regex with no luck (a lot of stuff is removed together with the first <Placemark>):

sed -r ':a; N; $!ba; s/\n\t*//g' myplaces.kml | perl -pe 's|<Placemark>.*?<tessellate>.*?</Placemark>||g' 

I believe a XML parser is the way to go, but I read the documentation for xmlstarlet and got nowhere. So any solutions in xmlstarlet, python, etc. are also welcome!

2

3 Answers 3

8

With xmlstarlet:

xmlstarlet ed -d '//Placemark[.//tessellate]' < myplaces.kml 

And as kml uses namespaces, you have to define it first (see the xmlstarlet documentation)

xmlstarlet ed -N 'ns=http://www.opengis.net/kml/2.2' -d '//ns:Placemark[.//ns:tessellate]' 

With perl, you'd need to process the file as a whole (not line by line) and add the s flag to s///. And even then, even with non-greedy match, it would still match from the first <Placemark> up the next </Placemark> that occurs after the next <tessellate>. So you'd need to write it something like:

perl -0777 -pe 's|(<Placemark>.*?</Placemark>)| $1 =~ /<tessellate>/?"":$1|gse' 
1
  • Using xmlstarlet is the best answer, works like a charm on complex XMLs as well as cases where selection needs to be based on attribute value. Also, if you are not able to install xmlstartlet using yum etc., see this link -- pkgs.org/download/xmlstarlet. I was able to download Linux package and run it as a standalone utility, without needing sudo/root access to install new packages.
    – Ccy
    CommentedOct 31, 2019 at 17:57
4

Given this test file:

start <Placemark> <tessellate>1</tessellate> </Placemark> middle1 <Placemark> </Placemark> middle2 <Placemark> <tessellate>1</tessellate> </Placemark> end 

If you do perl -0 -pe 's|<Placemark>.*?<tessellate>.*?</Placemark>||gs' like you suggested it will remove too much:

start middle1 end 

This is because the regex is only looking forward. It finds a start tag, takes everything until the first tessellate tag and up to the next end tag. Unfortunatey it does not care if it consumes more start tags in the way...

If you want to do it with regexes you have to process each block on its own: perl -0 -pe 's|<Placemark>.*?</Placemark>|$&=~/<tessellate>/?"":$&|gse'

This should give the desired result.

1
  • Just adding desired result output: start middle1 <Placemark> </Placemark> middle2 end CommentedJan 27, 2016 at 6:36
4

Using Python (2.7) with standard modules:

file test.xml:

<Container> <Placemark> <KeepMe/> </Placemark> <Placemark> <styleUrl>#m_ylw-pushpin330</styleUrl> <LineString> <tessellate>1</tessellate> <coordinates> 0.0000000000000,0.0000000000000,0 0.0000000000000,0.0000000000000,0 </coordinates> </LineString> </Placemark> </Container> 

And the program:

#! /usr/bin/env python from __future__ import print_function # works on 2.x and 3.x from lxml import etree file_name = 'test.xml' root = etree.parse(file_name) for element in root.iterfind('.//Placemark'): if(element.find('.//tessellate')) is not None: element.getparent().remove(element) print(etree.tostring(root)) 

gives as output:

<Container> <Placemark> <KeepMe/> </Placemark> </Container> 
1
  • You mentioned standard modules, but lxml is not standard. Did you mean ElementTree?
    – iruvar
    CommentedDec 12, 2014 at 20:46

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.