Extract string followed by specific word/symbol

Question

I have two lines as shown below in my input file input.txt and I need to extract claimStartDate from first line and claimEndDate from second line.

<ProfessionalClaim paymentIndicator="P" claimProcessedDateTime="20180409120000102" claimEndDate="2018-04-02" claimStartDate="2018-04-02" sourceSystemId="abcd" claimActionCode="00"> <ProfessionalClaim paymentIndicator="P" claimProcessedDateTime="20180430120000281" claimEndDate="2018-04-17" claimStartDate="2018-04-17" sourceSystemId="abcd" claimActionCode="00"> rm input.txt awk '/<ProfessionalClaim/' test.xml | head -1 > input.txt awk '/<ProfessionalClaim/' test.xml | tail -1 >> input.txt awk '{match($0, "claimStartDate=\"([^\"]+)\"", start); print start[1]} \ {match($0, "claimEndDate=\"([^\"]+)\"", end); print end[1]}' input.txt

F_LINE=<ProfessionalClaim paymentIndicator="P" claimProcessedDateTime="20180409120000102" claimEndDate="2018-04-02" claimStartDate="2018-04-02" sourceSystemId="abcd" claimActionCode="00"> L_LINE=<ProfessionalClaim paymentIndicator="P" claimProcessedDateTime="20180430120000281" claimEndDate="2018-04-17" claimStartDate="2018-04-17" sourceSystemId="abcd" claimActionCode="00"> — Velava Shanmugam, CommentedJan 24, 2019 at 7:23
These lines are in a text file you want to use as the input? Are there multiple F_LINE and L_LINE? How should your output look like? Please edit your question and add these information. Use the code button to present file contents and commands better. Thanks! — finswimmer, CommentedJan 24, 2019 at 7:35
I have pulled these two lines from XML file and use this as input to pull the claimStartDate from F_LINE & claimEndDate from L_LINE. I have changed the question now. Please let me know if need anymore details. thanks! — Velava Shanmugam, CommentedJan 24, 2019 at 7:38
It would be appropriate and more efficient to use an XML parser (like XMLStarlet or a Perl/Python XML parser module) on the original XML document. You have not shown how these lines are part of the original document or how you parse them out. — Kusalananda, CommentedJan 24, 2019 at 7:41

finswimmer · Accepted Answer · 2019-01-24 17:59:51Z

$ awk '/F_LINE/ {match($0, "claimStartDate=\"([^\"]+)\"", start); print start[1]} \ /L_LINE/ {match($0, "claimEndDate=\"([^\"]+)\"", end); print end[1]}' input.txt 2018-04-02 2018-04-17

EDIT due to your new information:

$ awk 'NR==1 {match($0, "claimStartDate=\"([^\"]+)\"", start); print start[1]} \ NR==2 {match($0, "claimEndDate=\"([^\"]+)\"", end); print end[1]}' input.txt 2018-04-02 2018-04-17

You can also do this all in one run:

$ grep "<ProfessionalClaim" text.xml \ | sed -n '1p;$p' \ | $ awk 'NR==1 {match($0, "claimStartDate=\"([^\"]+)\"", start); print start[1]} \ NR==2 {match($0, "claimEndDate=\"([^\"]+)\"", end); print end[1]}'

grep find all line with <ProfessionalClaim in text.xml
sed truncate the lines to the first and the last onyl
awk will print the claimStartDate for the first line and ClaimEndDate for the second line

my inputs are in two string variable F_LINE & L_LINE. what is this input.txt here? — Velava Shanmugam, CommentedJan 24, 2019 at 8:34
As you hasn't specify how you pulled the two lines I assumed they are in a new file called input.txt in my example. If this is not the case, provide more information in your original post, how you've extracted them and from where you start now. (show some code, what language are you using, ...) — finswimmer, CommentedJan 24, 2019 at 8:45
Earlier I was writing those two lines in to separate variable each called F_LINE and L_LINE {<ProfessionalClaim paymentIndicator="P" claimProcessedDateTime="20180409120000102" claimEndDate="2018-04-02" claimStartDate="2018-04-02" sourceSystemId="abcd" claimActionCode="00">} {<ProfessionalClaim paymentIndicator="P" claimProcessedDateTime="20180430120000281" claimEndDate="2018-04-17" claimStartDate="2018-04-17" sourceSystemId="abcd" claimActionCode="00">} — Velava Shanmugam, CommentedJan 24, 2019 at 16:25
I need only the claimStartDate from first line and claimEndDate from second line. — Velava Shanmugam, CommentedJan 24, 2019 at 16:34
Thanks a lot it s working fine! Also need to take one other field from first and last line.(ClaimProcessedDateTime). I am using the below one for that, but for some reason the paid_stop not getting populated. grep "<ProfessionalClaim" test.xml \ | sed -n '1p;$p' \ |awk 'NR==1 {match($0, "claimProcessedDateTime=\"([^\"]+)\"", start); print "paid_start " start[1]} \ NR==2 {match($0, "ClaimProcessedDateTime=\"([^\"]+)\"", end); print "paid_stop " end[1]}' — Velava Shanmugam, CommentedJan 24, 2019 at 19:05

Kusalananda · Accepted Answer · 2023-01-28 11:23:38Z

Assuming some XML input document like the following:

<?xml version="1.0"?> <root> <ProfessionalClaim paymentIndicator="P" claimProcessedDateTime="20180409120000102" claimEndDate="2018-04-02" claimStartDate="2018-04-02" sourceSystemId="abcd" claimActionCode="00"/> <ProfessionalClaim paymentIndicator="P" claimProcessedDateTime="20180430120000281" claimEndDate="2018-04-17" claimStartDate="2018-04-17" sourceSystemId="abcd" claimActionCode="00"/> <ProfessionalClaim paymentIndicator="P" claimProcessedDateTime="20180430120000281" claimEndDate="2018-04-18" claimStartDate="2018-04-18" sourceSystemId="abcd" claimActionCode="00"/> <ProfessionalClaim paymentIndicator="P" claimProcessedDateTime="20180430120000281" claimEndDate="2018-04-19" claimStartDate="2018-04-19" sourceSystemId="abcd" claimActionCode="00"/> </root>

... we may use xmlstarlet to extract the claimStartDate attribute's value from each ProfessionalClaim node that has another ProfessionalClaim node following it, together with that next ProfessionalClaim node's claimEndDate attribute's value:

xmlstarlet select --template \ --match '//ProfessionalClaim[following-sibling::ProfessionalClaim/@claimEndDate]' \ --value-of 'concat(@claimStartDate, " ", following-sibling::ProfessionalClaim/@claimEndDate)' \ -nl input.txt

This first matches each ProfessionalClaim node that is followed by another ProfessionalClaim node.

For each such node, the value of the claimStartDate attribute is concatenated with the value of the claimEndDate attribute of the following ProfessionalClaim node, with a single space character as delimiter.

Given my example document above, this would generate

2018-04-02 2018-04-17 2018-04-17 2018-04-18 2018-04-18 2018-04-19

Stack Exchange Network

Extract string followed by specific word/symbol

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Extract string followed by specific word/symbol

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions