Parsing for data in HTML using XPath (in a shell script)

Question

I am trying to parse a fairly simple web page for information in a shell script. The web page I'm working with now is generated here. For example, I would like to pull the information on the internet service provider into a shell variable. It may make sense to use one of the programs xmllint, XMLStarlet or xpath for this purpose. I am quite familiar with shell scripting, but I am new to XPath syntax and the utilities used to implement the XPath syntax, so I would appreciate a few pointers in the right direction.

Here's the beginnings of the shell script:

HTMLISPInformation="$(curl --user-agent "Mozilla/5.0" http://aruljohn.com/details.php)" # ISP="$(<XPath magic goes here.>)"

For your convenience, here is a utility for dynamically testing XPath syntax online:

http://www.bit-101.com/xpath/

Take a look at this.
– rae1
CommentedDec 26, 2012 at 20:04 — rae1, CommentedDec 26, 2012 at 20:04

Benjamin Loison · Accepted Answer · 2024-07-08 17:44:48Z

Quick and dirty solution...

xmllint --html -xpath "//table/tbody/tr[6]/td[2]" page.html

You can find the xpath of your node using Chrome and the Developer Tools. When inspecting the node, right click on it and select copy XPath.

I wouldn't use this too much, this is not very reliable.

All the information on your page can be found elsewhere: run whois on your own IP for instance...

I get "XPath set is empty", even piping through `sed -e 's/xmlns=".*"//g'`` — Pablo Bianchi, CommentedJan 1, 2020 at 22:13
I had some success with specifying full path from root /html/body/table/tr/td ... and also removing the <!DOCTYPE directive. — Justin, CommentedAug 19, 2021 at 21:54

Benjamin Loison · Accepted Answer · 2024-07-08 17:45:33Z

You could use my Xidel. Extracting values from html pages in the cli is its main purpose. Although it is not a standard tool, it is a single, dependency-free binary, and can be installed/run without being root.

It can directly read the value from the webpage without involving other programs.

With XPath:

xidel http://aruljohn.com/details.php -e '//td[text()="Internet Provider"]/following-sibling::td'

Or with pattern-matching:

xidel http://aruljohn.com/details.php -e '<td>Internet Provider</td><td>{.}</td>' --hide-variable-names

asgoth · Accepted Answer · 2012-12-26 20:08:18Z

3

Consider on using PhantomJs. It is a headless WebKit, which allows you to execute JavaScript/CoffeeScript on a web page. I think it could help you solve your issue.

Pjscrape is a useful web scraping tool based on PhantomJs.

answered Dec 26, 2012 at 20:08

asgoth

35.8k12 gold badges91 silver badges98 bronze badges

Thank you. I will take a look at it for my personal use. However, the task I hope to accomplish is to be done on a server on which I am not granted root access, which is why I mentioned standard tools such as xmllint.
– d3pd
CommentedDec 26, 2012 at 20:53
Do you need root access? You could just copy it in your user folder and run it from there.
– asgoth
CommentedDec 26, 2012 at 21:12

Add a comment |

Benjamin Loison · Accepted Answer · 2024-07-08 17:46:48Z

`xpup`

XML

A command line XML parsing tool written in Go. For example:

$ curl -sL https://www.w3schools.com/xml/note.xml | xpup '/*/body' Don't forget me this weekend!

or:

$ xpup '/note/from' < <(curl -sL https://www.w3schools.com/xml/note.xml) Jani

HTML

Here is the example of parsing HTML page:

$ xpup '/*/head/title' < <(curl -sL https://example.com/) Example Domain

`pup`

For HTML parsing, try pup. For example:

$ pup 'title text{}' -f <(curl -sL https://example.com/) Example Domain

See related Feature Request for XPath.

Installation

Install by: go get github.com/ericchiang/pup.

kenorb · Accepted Answer · 2018-04-11 12:01:04Z

HTML-XML-utils

There are many command-line tools in HTML-XML-utils package which can parse HTML files (e.g. hxselect to match a CSS selector).

Also there is xpath which is command-line wrapper around Perl's XPath library (XML::Path).

Related: Command line tool to query HTML elements at SU

Collectives™ on Stack Overflow

Parsing for data in HTML using XPath (in a shell script)

5 Answers 5

`xpup`

XML

HTML

`pup`

Installation

HTML-XML-utils

Linked

Hot Network Questions