12

I am trying to parse a fairly simple web page for information in a shell script. The web page I'm working with now is generated here. For example, I would like to pull the information on the internet service provider into a shell variable. It may make sense to use one of the programs xmllint, XMLStarlet or xpath for this purpose. I am quite familiar with shell scripting, but I am new to XPath syntax and the utilities used to implement the XPath syntax, so I would appreciate a few pointers in the right direction.

Here's the beginnings of the shell script:

HTMLISPInformation="$(curl --user-agent "Mozilla/5.0" http://aruljohn.com/details.php)" # ISP="$(<XPath magic goes here.>)" 

For your convenience, here is a utility for dynamically testing XPath syntax online:

http://www.bit-101.com/xpath/

1
  • Take a look at this.
    – rae1
    CommentedDec 26, 2012 at 20:04

5 Answers 5

10

Quick and dirty solution...

xmllint --html -xpath "//table/tbody/tr[6]/td[2]" page.html 

You can find the xpath of your node using Chrome and the Developer Tools. When inspecting the node, right click on it and select copy XPath.

I wouldn't use this too much, this is not very reliable.

All the information on your page can be found elsewhere: run whois on your own IP for instance...

2
5

You could use my Xidel. Extracting values from html pages in the cli is its main purpose. Although it is not a standard tool, it is a single, dependency-free binary, and can be installed/run without being root.

It can directly read the value from the webpage without involving other programs.

With XPath:

xidel http://aruljohn.com/details.php -e '//td[text()="Internet Provider"]/following-sibling::td' 

Or with pattern-matching:

xidel http://aruljohn.com/details.php -e '<td>Internet Provider</td><td>{.}</td>' --hide-variable-names 
    3

    Consider on using PhantomJs. It is a headless WebKit, which allows you to execute JavaScript/CoffeeScript on a web page. I think it could help you solve your issue.

    Pjscrape is a useful web scraping tool based on PhantomJs.

    2
    • Thank you. I will take a look at it for my personal use. However, the task I hope to accomplish is to be done on a server on which I am not granted root access, which is why I mentioned standard tools such as xmllint.
      – d3pd
      CommentedDec 26, 2012 at 20:53
    • Do you need root access? You could just copy it in your user folder and run it from there.
      – asgoth
      CommentedDec 26, 2012 at 21:12
    3

    xpup

    XML

    A command line XML parsing tool written in Go. For example:

    $ curl -sL https://www.w3schools.com/xml/note.xml | xpup '/*/body' Don't forget me this weekend! 

    or:

    $ xpup '/note/from' < <(curl -sL https://www.w3schools.com/xml/note.xml) Jani 

    HTML

    Here is the example of parsing HTML page:

    $ xpup '/*/head/title' < <(curl -sL https://example.com/) Example Domain 

    pup

    For HTML parsing, try pup. For example:

    $ pup 'title text{}' -f <(curl -sL https://example.com/) Example Domain 

    See related Feature Request for XPath.

    Installation

    Install by: go get github.com/ericchiang/pup.

      1

      HTML-XML-utils

      There are many command-line tools in HTML-XML-utils package which can parse HTML files (e.g. hxselect to match a CSS selector).

      Also there is xpath which is command-line wrapper around Perl's XPath library (XML::Path).

      Related: Command line tool to query HTML elements at SU

        Start asking to get answers

        Find the answer to your question by asking.

        Ask question

        Explore related questions

        See similar questions with these tags.