0

Using Linux server

I have a URL which generates the below data:

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="/seriessnapshot.xsl"?> <timeSeries> <series parentPath="uat.fft.client.CB1201C.AP714628.fusion.ebond-fusion-nucleus-app.ebond-fusion-order-app" id="OpenFin Memory(MB)" latestItemTimestamp="1588874010094" datetime="2020/05/07 19:53:30" latestItemValue="101"/> <series parentPath="uat.fft.client.CB1201C.AP714628.fusion.ebond-fusion-nucleus-app.ebond-fusion-rfq-credit" id="OpenFin Memory(MB)" latestItemTimestamp="1588874010094" datetime="2020/05/07 19:53:30" latestItemValue="101"/> <series parentPath="uat.fft.client.CB1201C.AP714628.fusion.ebond-fusion-nucleus-app.ebond-fusion-risk-app" id="OpenFin Memory(MB)" latestItemTimestamp="1588874010094" datetime="2020/05/07 19:53:30" latestItemValue="96"/> <series parentPath="uat.fft.client.CB3ERWE.AP717938.fusion.ebond-fusion-nucleus-app.ebond-fusion-order-app" id="OpenFin Memory(MB)" latestItemTimestamp="1588860133654" datetime="2020/05/07 16:02:13" latestItemValue="101"/> <series parentPath="uat.fft.client.CB3ERWE.AP717938.fusion.ebond-fusion-nucleus-app.ebond-fusion-rfq-credit" id="OpenFin Memory(MB)" latestItemTimestamp="1588860133654" datetime="2020/05/07 16:02:13" latestItemValue="103"/> <series parentPath="uat.fft.client.CB3ERWE.AP717938.fusion.ebond-fusion-nucleus-app.ebond-fusion-risk-app" id="OpenFin Memory(MB)" latestItemTimestamp="1588860133654" datetime="2020/05/07 16:02:13" latestItemValue="99"/> <series parentPath="uat.fft.client.GA2ADAZ.AP718017.fusion.ebond-fusion-nucleus-app.ebond-fusion-rfq-credit" id="OpenFin Memory(MB)" latestItemTimestamp="1588874018986" datetime="2020/05/07 19:53:38" latestItemValue="107"/> <series parentPath="uat.fft.client.GA2BASV.AP722002.fusion.ebond-fusion-nucleus-app.ebond-fusion-rfq-credit" id="OpenFin Memory(MB)" latestItemTimestamp="1588866113043" datetime="2020/05/07 17:41:53" latestItemValue="110"/> <series parentPath="uat.fft.client.GA2BHUH.AP717267.fusion.ebond-fusion-nucleus-app.ebond-fusion-order-app" id="OpenFin Memory(MB)" latestItemTimestamp="1588864837395" datetime="2020/05/07 17:20:37" latestItemValue="102"/> <series parentPath="uat.fft.client.GA2BHUH.AP717267.fusion.ebond-fusion-nucleus-app.ebond-fusion-rfq-credit" id="OpenFin Memory(MB)" latestItemTimestamp="1588864837395" datetime="2020/05/07 17:20:37" latestItemValue="126"/> <series parentPath="uat.fft.client.GA2BHUH.AP717267.fusion.ebond-fusion-nucleus-app.ebond-fusion-risk-app" id="OpenFin Memory(MB)" latestItemTimestamp="1588864837395" datetime="2020/05/07 17:20:37" latestItemValue="114"/> <series parentPath="uat.fft.client.GA2CRAD.AP718024.fusion.ebond-fusion-nucleus-app.ebond-fusion-rfq-sales" id="OpenFin Memory(MB)" latestItemTimestamp="1588862905103" datetime="2020/05/07 16:48:25" latestItemValue="102"/> <series parentPath="uat.fft.client.GA2MRAH.AP711671.fusion.ebond-fusion-nucleus-app.ebond-fusion-quote-app" id="OpenFin Memory(MB)" latestItemTimestamp="1588867209058" datetime="2020/05/07 18:00:09" latestItemValue="103"/> <series parentPath="uat.fft.client.GA2MRAH.AP711671.fusion.ebond-fusion-nucleus-app.ebond-fusion-rfq-credit" id="OpenFin Memory(MB)" latestItemTimestamp="1588867209058" datetime="2020/05/07 18:00:09" latestItemValue="116"/> <series parentPath="uat.fft.client.GA2MRAH.AP711671.fusion.ebond-fusion-nucleus-app.ebond-fusion-risk-app" id="OpenFin Memory(MB)" latestItemTimestamp="1588867209058" datetime="2020/05/07 18:00:09" latestItemValue="113"/> <series parentPath="uat.fft.client.GA2OUGB.AP721570.fusion.ebond-fusion-nucleus-app.ebond-fusion-order-app" id="OpenFin Memory(MB)" latestItemTimestamp="1588866117341" datetime="2020/05/07 17:41:57" latestItemValue="104"/> <series parentPath="uat.fft.client.GA2OUGB.AP721570.fusion.ebond-fusion-nucleus-app.ebond-fusion-rfq-credit" id="OpenFin Memory(MB)" latestItemTimestamp="1588866117341" datetime="2020/05/07 17:41:57" latestItemValue="112"/> <series parentPath="uat.fft.client.GA2OUGB.AP721570.fusion.ebond-fusion-nucleus-app.ebond-fusion-risk-app" id="OpenFin Memory(MB)" latestItemTimestamp="1588866117341" datetime="2020/05/07 17:41:57" latestItemValue="116"/> <series parentPath="uat.fft.client.GA2PASH.AP722622.fusion.ebond-fusion-nucleus-app.ebond-fusion-rfq-credit" id="OpenFin Memory(MB)" latestItemTimestamp="1588850319464" datetime="2020/05/07 13:18:39" latestItemValue="103"/> <series parentPath="uat.fft.client.GA2SA7H.AP721875.fusion.ebond-fusion-nucleus-app.ebond-fusion-quote-app" id="OpenFin Memory(MB)" latestItemTimestamp="1588866495109" datetime="2020/05/07 17:48:15" latestItemValue="110"/> <series parentPath="uat.fft.client.GA2SA7H.AP721875.fusion.ebond-fusion-nucleus-app.ebond-fusion-rfq-credit" id="OpenFin Memory(MB)" latestItemTimestamp="1588866495109" datetime="2020/05/07 17:48:15" latestItemValue="102"/> <series parentPath="uat.fft.client.GA2SA7H.AP721875.fusion.ebond-fusion-nucleus-app.ebond-fusion-risk-app" id="OpenFin Memory(MB)" latestItemTimestamp="1588866495109" datetime="2020/05/07 17:48:15" latestItemValue="123"/> <series parentPath="uat.fft.client.ga2cria.AP716960.fusion.ebond-fusion-nucleus-app.ebond-fusion-quote-app" id="OpenFin Memory(MB)" latestItemTimestamp="1588874030265" datetime="2020/05/07 19:53:50" latestItemValue="130"/> <series parentPath="uat.fft.client.ga2cria.AP716960.fusion.ebond-fusion-nucleus-app.ebond-fusion-rfq-credit" id="OpenFin Memory(MB)" latestItemTimestamp="1588874030265" datetime="2020/05/07 19:53:50" latestItemValue="125"/> <series parentPath="uat.fft.client.ga2cria.AP716960.fusion.ebond-fusion-nucleus-app.ebond-fusion-risk-app" id="OpenFin Memory(MB)" latestItemTimestamp="1588874030265" datetime="2020/05/07 19:53:50" latestItemValue="107"/> <series parentPath="uat.fft.client.ga2fasa.AP715911.fusion.ebond-fusion-nucleus-app.ebond-fusion-order-app" id="OpenFin Memory(MB)" latestItemTimestamp="1588873964945" datetime="2020/05/07 19:52:44" latestItemValue="101"/> <series parentPath="uat.fft.client.ga2fasa.AP715911.fusion.ebond-fusion-nucleus-app.ebond-fusion-risk-app" id="OpenFin Memory(MB)" latestItemTimestamp="1588873964945" datetime="2020/05/07 19:52:44" latestItemValue="113"/> 

As you can see each AP number is different and the two letters at the start could also be different, so on different URLs AP could also be BR AD CS. I need to write a command to extract and create a list with just all the AP numbers but the AP part should be able to extract any letters, just how the number part will need to accept any number, and i also need to remove the duplicate AP's from the list being extracted.

So far i have created the below shell script:

echo 'ID,User,HostName,Application,DateTime,Value' curl -k -s 'https://testurl' 

I now need to add a pipe to the end of the command and using either awk, sed or grep. I need to extract all the Ap numbers, remove the duplicate AP numbers and create a list of all the AP numbers, but please keep in mind the letters AP are also dynamic, they could change to any two letters of the alphabet.

I tried the below awk command:

/usr/bin/curl -k -s https://example.com:18080/seriessnapshot?substringSearch=OpenFin%20Memory | awk -F".AP" '{print $2}' | awk -F. '{print $1}' | sort | uniq 

However this only returns the below:

717958 717961 717962 717977 717980 717982 717996 718397 718685 718954 719045 719051 719257 719262 719265 719432 719488 719821 719905 719906 720203 720455 720467 720911 721548 721569 721732 721737 

It does not return the two letters at the start of the number which could be any two letters, any idea how i can tell the awk command to accept any two letters before the numbers and then accept any number and print out that whole column?

2
  • edit your question to show the exact expected output given your posted sample input.
    – Ed Morton
    CommentedMay 8, 2020 at 0:46
  • Which parts of that XML document are the "AP numbers"? Looking at the first result set line, is it AP714628?CommentedMay 21, 2020 at 20:40

3 Answers 3

3

@Nikhil, your original question didn't include any mention of XML formatting, it had only the raw content. Since you know the data is contained in XML, a more robust approach will be to follow the recommendation of one of the other answers that suggests xmlstarlet or a tool specific to this job.

If you are sure that the format won't change and want to go quick and dirty on this, and you are sure you want to use awk, would this work for you:

awk -F"." '{print $5}' | sort --unique 

Here is a link to this code and some sample data from your question: Try it online!

5
  • Hi Spuck, i have just edited the output of the URL in my question. I had to change the characters from 26-31 to 44-51. However it is also returning some unwanted test ie "iessnaps" which i assume is coming from the line "<?xml-stylesheet type="text/xsl" href="/seriessnapshot.xsl"?> " i am aware i could just do a sed command and replace it with nothing but is there an easier way this can be incorporated into your suggestion?
    – Nikhil
    CommentedMay 7, 2020 at 18:03
  • @Nikhil, I've updated my answer to match this important detail.
    – spuck
    CommentedMay 7, 2020 at 19:00
  • any idea on how to do this using the AWK command to return the AP - after further testing i found out that there are also numbers with different starting two letters. So was thinking for this to work in all environments best to just use awk and return the column with the numbers rather then narrowing it down to AP as there are many others such as PR AD
    – Nikhil
    CommentedMay 14, 2020 at 9:40
  • @Nikhil I'm not 100% sure what you're asking, but I will take a stab at by editing my answer.
    – spuck
    CommentedMay 14, 2020 at 20:06
  • maybe just add a filter like: /series/ to the awk line? and use : sort -u (more portable than the --unique)CommentedMay 22, 2020 at 6:12
2
curl "URL" | grep -E -o 'AP[[:digit:]]+' | sort -u 

This would fetch the data, extract all AP-numbers, and sort them while also removing duplicates.

This assumes that the AP-numbers only occurs in in the data as you have shown and not in any other irrelevant position (the XML document that you showed is truncated at the end).


For a slightly safer parsing of the XML, use xmlstarlet:

curl "URL" | xmlstarlet sel -t -v '/timeSeries/series/@parentPath[contains(.,"AP")]' | grep -E -o 'AP[[:digit:]]+' | sort -u 

This parses out all the long values of the parentPath attributes, and passes those through grep and sort.


Doing it all with xmlstarlet is also possible. Here I'm assuming you want all the AP-numbers that correspond to the attribute values that contain the string credit:

curl "URL" | xmlstarlet sel -t \ -m '/timeSeries/series/@parentPath[contains(., "credit")]' \ -v 'concat("AP", substring-before(substring-after(., "AP"), "."))' -nl 

If looking for just the string credit will miss some AP-numbers, then just extract them all and make the resulting list unique with sort -u as before:

curl "URL" | xmlstarlet sel -t \ -m '/timeSeries/series/@parentPath' \ -v 'concat("AP", substring-before(substring-after(., "AP"), "."))' -nl | sort -u 
3
  • any idea on how to do this using the AWK command to return the AP - after further testing i found out that there are also numbers with different starting two letters. So was thinking for this to work in all environments best to just use awk and return the column with the numbers rather then narrowing it down to AP as there are many others such as PR AD
    – Nikhil
    CommentedMay 14, 2020 at 9:40
  • @Nikhil You shouldn't really accept an answer if it doesn't solve your issue. Also, we can only see the data that you include in your actual question, so if the data there is not representative of what you have, then we can't really do much about it. Please update your question. Also, I will never use awk to parse XML, it is definitely not the correct tool for that job.
    – Kusalananda
    CommentedMay 14, 2020 at 9:43
  • Apologies, i am new to stack exchange. I accepted the answers as at the time they were the right outcome for me. However after further testing found out that the AP part was not static, it could be any letters hence why i was asking regarding the awk command, i have made changes to my question in the meantime, ta
    – Nikhil
    CommentedMay 14, 2020 at 9:55
0

This is assuming the file format sticks to your listing...

cat sample.txt | awk -F".AP" '{print $2}' | awk -F. '{print $1}' | sort | uniq 
4
  • 1
    you can replace the "cat sample.txt" with your command.
    – myT250
    CommentedMay 7, 2020 at 18:36
  • I did as you suggested but my output returns the numbers but the AP part is missing
    – Nikhil
    CommentedMay 7, 2020 at 21:36
  • Hi, any idea why it only returns the numbers? i also need it to return the AP part, however the AP part could also be different letters, so i need the awk to return the whole column as the who string is dynamic, on different urls it could be different letters before the numbers
    – Nikhil
    CommentedMay 14, 2020 at 9:45
  • Please note that the approach is dangerous because awk interprets multi-character field separators as in -F".AP"as a regular expression, meaning that the . will mean "any character". So, if the parentPath were something like uat.fft.client.CBAP01C.AP934123, it would already match the BAP and use that as FS. The AP part is missing beacuse it is absorbed into the FS and therefore no longer part of the field, so you need to prepend it to the print statement, as in print "AP" $2. Also, your awk command produces empty lines as it matches any line even without AP.
    – AdminBee
    CommentedMay 19, 2020 at 9:22

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.