2

I'm using this command:

xmllint --xpath 'substring-after(string(//item/link), "_")' rss.xml 

And get the desired output, except it's on the first element. How would I fix this to have it be applied to each link?

I'm open to using any utility, so long as the sample input is accepted and one expression can be used to get the desired output.

Sample Input:

<rss version="2.0"> <channel> <title>Malicious IPs | By Last Bad Event | Project Honey Pot</title> <link><![CDATA[http://www.projecthoneypot.org/list_of_ips.php]]></link> <description/> <copyright>Copyright 2021 Unspam Technologies, Inc</copyright> <language>en-us</language> <lastBuildDate>July 03 2021 07:15:12 PM</lastBuildDate> <image> <url>http://www.projecthoneypot.org/images/small_phpot_logo.jpg</url> <title>Project Honey Pot | Distribute Spammer Tracking System</title> <link>http://www.projecthoneypot.org</link> </image> <item> <title>92.204.241.167 | C</title> <link>http://www.projecthoneypot.org/ip_92.204.241.167</link> <description>Event: Bad Event | Total: 3,061 | First: 2021-03-27 | Last: 2021-07-03</description> <pubDate>July 03 2021 07:15:12 PM</pubDate> </item> <item> <title>181.24.239.244</title> <link>http://www.projecthoneypot.org/ip_181.24.239.244</link> <description>Event: Bad Event | Total: 1 | First: 2021-07-03 | Last: 2021-07-03</description> <pubDate>July 03 2021 07:15:12 PM</pubDate> </item> <item> <title>193.243.195.66 | S</title> <link>http://www.projecthoneypot.org/ip_193.243.195.66</link> <description>Event: Bad Event | Total: 4 | First: 2021-06-12 | Last: 2021-07-03</description> <pubDate>July 03 2021 07:15:12 PM</pubDate> </item> </channel> </rss> 

Desired Output:

92.204.241.167 181.24.239.244 193.243.195.66 

Present Output:

92.204.241.167 
3
  • Are you open to using xmlstarlet rather than xmllint?CommentedJul 4, 2021 at 21:51
  • @steeldriver I'm open to it, but I'd like to keep it all in one XPath expression if possible. I don't want to use xmlstarlet b/c then I'd need xmlstarlet sel -t -m "EXP1" -v "EXP2"
    – T145
    CommentedJul 4, 2021 at 23:57
  • Try xmllint --xpath '//item/link' rss.xml | sed 's/\(.*_\)\(.*\)\(<.*$\)/\2/g'
    – fpmurphy
    CommentedJul 5, 2021 at 0:59

3 Answers 3

2

Using xmlstarlet:

xmlstarlet sel -t -m '//item/link' -v 'substring-after(., "_")' -nl rss.xml 

This first matches (-m) all //item/link nodes, and then gets the value (-v) associated with the string after the first underscore character in the matched nodes' values. The final -nl outputs a newline character between each resulting string.

The second expression (substring-after()) will be evaluated for each node in the set matched by the first.

    1

    You actually can't achieve this using XPath 1.0 alone. You can't return a sequence of strings, because there is no such data type in XPath 1.0, and you can't return a single string that concatenates the various substrings because you would still need the sequence of substrings as an intermediate result, and again, there is no such data type. So you either need to move to XPath 2.0+, or you need some assistance from a host language that executes multiple XPath expressions - which is what the xmlstarlet solution from @Kusalananda is doing.

    You're on the command line, however, so there's a very wide choice of tools available - you could use XQuery just as easily as XPath, and you're certainly not restricted to the ancient XPath 1.0 version. For example with Saxon you could do

    java net.sf.saxon.Query -qs:"//item/link!substring-after(., '_')" -s:rss.xml 

    This uses the "bang" operator, available in XPath 3.0 and XQuery 3.0, which applies the expression on the right to every item selected by the expression on the left.

    2
    • Like I mentioned in the comments I would use xmlstarlet if there were only one expression. Do you know of any other similar CLI commands and examples?
      – T145
      CommentedJul 5, 2021 at 13:21
    • There also does appear to be a Linux Saxon release: saxonica.com/download/c.xml
      – T145
      CommentedJul 5, 2021 at 14:55
    1

    My Xidel is another tool to run modern XPath expressions:

    xidel rss.xml --xpath "//item/link/substring-after(., '_')" 
    2
    • Hello, welcome to stack overflow. The tool looks interesting, but could you be more specific (with an example perhaps) about how it solves the exact problem the OP was asking about? Perhaps break down your example to show what it does?CommentedJul 21, 2021 at 14:54
    • Well, it is basically the same query as in the accepted answer (Both / or ! can be used here interchangeably, because the left side is a list of nodes). Since Xidel and Saxon implement the same XPath standards, they can run the same queries and should give the same outputs. But the advantage is that Xidel is written in Pascal rather than Java, so it starts faster
      – BeniBela
      CommentedJul 21, 2021 at 16:44

    You must log in to answer this question.

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.