-1

I am trying to extract some lines within a <w:t> tag in front and </w:t> tag at the end of the text I want, but im only getting the text within last tags and not the others. How can i do this?

This is the code ive been trying to use:

grep '<w:t>' word/document.xml | sed 's/.*<w:t>\(.*)<\/w:t>.*/\1/' | cat > brev.txt 

as you can see I'm greping from the document.xml file within the word directory and finding the tag within the file and transfering it to a file named brev.txt, but it's not working entirely. How do i get all the lines and not just the last line with the tag?

the document.xml file is a one line text file, if that makes any difference.

I also tried another code and this gave me everything from the first <w:t> tag and including everything within until the last </w:t> tag. So alot of extra text within, the following code for that was:

grep -o '<w:t>.*</w:t>' word/document.xml | sed 's/\(<w:t>\|<\/w:t>\)//g' > brev.txt 

Sample file (formatted for readability; the original is a single line)

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink" xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 w16se w16cid wp14"> <w:body> <w:p w14:paraId="35B527D8" w14:textId="4CF0BDCB" w:rsidR="0068138C" w:rsidRDefault="00BF1E48"> <w:r> <w:t>Here’s a Word document. It has several sentences.</w:t> </w:r> </w:p> <w:p w14:paraId="4AADFADF" w14:textId="4F49E2CE" w:rsidR="00BF1E48" w:rsidRDefault="00BF1E48"> <w:r> <w:t>Most are short.</w:t> </w:r> </w:p> <w:p w14:paraId="608ED30C" w14:textId="2163C420" w:rsidR="00BF1E48" w:rsidRDefault="00BF1E48"> <w:r> <w:t>All are in English.</w:t> </w:r> </w:p> <w:p w14:paraId="0B67C683" w14:textId="77777777" w:rsidR="00BF1E48" w:rsidRDefault="00BF1E48"> <w:bookmarkStart w:id="0" w:name="_GoBack"/> <w:bookmarkEnd w:id="0"/> </w:p> <w:sectPr w:rsidR="00BF1E48"> <w:pgSz w:w="11906" w:h="16838"/> <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/> <w:cols w:space="708"/> <w:docGrid w:linePitch="360"/> </w:sectPr> </w:body> </w:document> 
1
  • Having an example file to test solutions on would be most helpful.
    – Kusalananda
    CommentedNov 6, 2020 at 16:52

1 Answer 1

2

Use an XML parser to parse XML. Using the sample document I added to your question,

xmlstarlet sel -t -v '//w:t' -n word/document.xml >brev.txt cat brev.txt Here’s a Word document. It has several sentences. Most are short. All are in English. 

If you really cannot get hold of an XML parser, but you have GNU grep, you could use this pattern. But it's the wrong way to approach the problem

grep -oP '(?<=<w:t>).*?(?=</w:t>)' word/document.xml 
3
  • 2
    Thank you for editing my question and making it better. The sample file was a great idea, thank you. Thank you for your suggestion! Why is it the wrong way to approach the problem?CommentedNov 6, 2020 at 17:32
  • XML doesn't define a physical file layout. At the moment it's all one line, but the XML file would still be perfectly valid if the component <w:t was on a different line to its corresponding >…</w:t>. In such a situation the XML parser xmlstarlet would continue to work but the grep would fail (silently)CommentedNov 6, 2020 at 22:55
  • Thank you for your time.CommentedNov 6, 2020 at 23:24

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.