How to extract mutliple text lines within xml file by using grep and/or sed

Question

I am trying to extract some lines within a <w:t> tag in front and </w:t> tag at the end of the text I want, but im only getting the text within last tags and not the others. How can i do this?

This is the code ive been trying to use:

grep '<w:t>' word/document.xml | sed 's/.*<w:t>\(.*)<\/w:t>.*/\1/' | cat > brev.txt

as you can see I'm greping from the document.xml file within the word directory and finding the tag within the file and transfering it to a file named brev.txt, but it's not working entirely. How do i get all the lines and not just the last line with the tag?

the document.xml file is a one line text file, if that makes any difference.

I also tried another code and this gave me everything from the first <w:t> tag and including everything within until the last </w:t> tag. So alot of extra text within, the following code for that was:

grep -o '<w:t>.*</w:t>' word/document.xml | sed 's/\(<w:t>\|<\/w:t>\)//g' > brev.txt

Sample file (formatted for readability; the original is a single line)

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink" xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 w16se w16cid wp14"> <w:body> <w:p w14:paraId="35B527D8" w14:textId="4CF0BDCB" w:rsidR="0068138C" w:rsidRDefault="00BF1E48"> <w:r> <w:t>Here’s a Word document. It has several sentences.</w:t> </w:r> </w:p> <w:p w14:paraId="4AADFADF" w14:textId="4F49E2CE" w:rsidR="00BF1E48" w:rsidRDefault="00BF1E48"> <w:r> <w:t>Most are short.</w:t> </w:r> </w:p> <w:p w14:paraId="608ED30C" w14:textId="2163C420" w:rsidR="00BF1E48" w:rsidRDefault="00BF1E48"> <w:r> <w:t>All are in English.</w:t> </w:r> </w:p> <w:p w14:paraId="0B67C683" w14:textId="77777777" w:rsidR="00BF1E48" w:rsidRDefault="00BF1E48"> <w:bookmarkStart w:id="0" w:name="_GoBack"/> <w:bookmarkEnd w:id="0"/> </w:p> <w:sectPr w:rsidR="00BF1E48"> <w:pgSz w:w="11906" w:h="16838"/> <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/> <w:cols w:space="708"/> <w:docGrid w:linePitch="360"/> </w:sectPr> </w:body> </w:document>

Having an example file to test solutions on would be most helpful. — Kusalananda, CommentedNov 6, 2020 at 16:52

Chris Davies · Accepted Answer · 2020-11-06 17:33:36Z

Use an XML parser to parse XML. Using the sample document I added to your question,

xmlstarlet sel -t -v '//w:t' -n word/document.xml >brev.txt cat brev.txt Here’s a Word document. It has several sentences. Most are short. All are in English.

If you really cannot get hold of an XML parser, but you have GNU grep, you could use this pattern. But it's the wrong way to approach the problem

grep -oP '(?<=<w:t>).*?(?=</w:t>)' word/document.xml

Thank you for editing my question and making it better. The sample file was a great idea, thank you. Thank you for your suggestion! Why is it the wrong way to approach the problem? — Sad student, CommentedNov 6, 2020 at 17:32
XML doesn't define a physical file layout. At the moment it's all one line, but the XML file would still be perfectly valid if the component <w:t was on a different line to its corresponding >…</w:t>. In such a situation the XML parser xmlstarlet would continue to work but the grep would fail (silently) — Chris Davies, CommentedNov 6, 2020 at 22:55

Stack Exchange Network

How to extract mutliple text lines within xml file by using grep and/or sed

1 Answer 1

You must log in to answer this question.

Hot Network Questions

How to extract mutliple text lines within xml file by using grep and/or sed

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions