Python Programming/Web

beautifulsoup4
Screen-scraping library
PyPi Link	https://pypi.python.org/pypi/beautifulsoup4
Pip command	pip install beautifulsoup4
Import command	import bs4

requests
Python HTTP for Humans
PyPi Link	https://pypi.python.org/pypi/requests
Pip command	pip install requests

Python web requests/parsing is very simple, and there are several must-have modules to help with this.

Urllib

Urllib is the built in python module for html requests, main article is Python Programming/Internet.

try:importurllib2except(ModuleNotFoundError,ImportError):#ModuleNotFoundError is 3.6+importurllib.parseasurllib2url='https://www.google.com'u=urllib2.urlopen(url)content=u.read()#content now has all of the html in google.com

Requests

The python requests library simplifies http requests. It has functions for each of the http requests

GET (requests.get)
POST (requests.post)
HEAD (requests.head)
PUT (requests.put)
DELETE (requests.delete)
OPTIONS (requests.options)

Basic request

importrequestsurl='https://www.google.com'r=requests.get(url)

The response object

The response from the last function has many variables/data retrieval.

>>>importrequests>>>r=requests.get('https://www.google.com')>>>print(r)<Response[200]>>>>dir(r)# dir shows all variables, functions, basically anything you can do with var.n where n is something to do['__attrs__','__bool__','__class__','__delattr__','__dict__','__dir__','__doc__','__enter__','__eq__','__exit__','__format__','__ge__','__getattribute__','__getstate__','__gt__','__hash__','__init__','__init_subclass__','__iter__','__le__','__lt__','__module__','__ne__','__new__','__nonzero__','__reduce__','__reduce_ex__','__repr__','__setattr__','__setstate__','__sizeof__','__str__','__subclasshook__','__weakref__','_content','_content_consumed','_next','apparent_encoding','close','connection','content','cookies','elapsed','encoding','headers','history','is_permanent_redirect','is_redirect','iter_content','iter_lines','json','links','next','ok','raise_for_status','raw','reason','request','status_code','text','url']

r.content and r.text provide similar html content, but r.text is preferred.
r.encoding will display the encoding of the website.
r.headers shows the headers returned by the website.
r.is_redirect and r.is_permanent_redirect shows whether or not the original link was a redirect.
r.iter_content will iterate each character in the html as a byte. To convert bytes to string, it must be decoded with the encoding in r.encoding.
r.iter_lines is like r.iter_content, but will iterate each line of the html. It is also in bytes
r.json will convert json to a python dict if the return output is json.
r.rawwill return the base urllib3.response.HTTPResponse object.
r.status_code will return the html code sent by the server. Code 200 is success, while any other code is an error. r.raise_for_status will return an exception if the status code is not 200.
r.url will return the url sent.

Authentication

Requests has built-in authentication. Here is an example with basic authentication.

importrequestsr=requests.get('http://example.com',auth=requests.auth.HTTPBasicAuth('username','password'))

If it is Basic Authentication, you can just pass a tuple.

importrequestsr=requests.get('http://example.com',auth=('username','password'))

All of the other types of authentication are at the requests documentation.

Queries

Queries in html pass values. For example, when you make a google search, the search url is a form of https://www.google.com/search?q=My+Search+Here&.... Anything after the ? is the query. Queries are url?name1=value1&name2=value2.... Requests has a system for automatically making these queries.

>>>importrequests>>>query={'q':'test'}>>>r=requests.get('https://www.google.com/search',params=query)>>>print(r.url)#prints the final urlhttps://www.google.com/search?q=test

The true power is noticed in multiple entries.

>>>importrequests>>>query={'name':'test','fakeparam':'yes','anotherfakeparam':'yes again'}>>>r=requests.get('http://example.com',params=query)>>>print(r.url)#prints the final urlhttp://example.com/?name=test&fakeparam=yes&anotherfakeparam=yes+again

Not only does it pass these values but also changes special characters & whitespace to html-compatible versions.

BeautifulSoup4

BeautifulSoup4 is a powerful html parsing command. Let's try with some example html.

>>>importbs4>>>example_html="""<!DOCTYPE html>... <html>... <head>... <title>Testing website</title>... <style>.b{color: blue;}</style>... </head>... <body>... <h1 class='b', id = 'hhh'>A Blue Header</h1>... <p> I like blue text, I like blue text... </p>... <p class = 'b'> This text is blue, yay yay yay!</p>... <p class = 'b'>Check out the <a href = '#hhh'>Blue Header</a></p>... </body>... </html>... """>>>bs=bs4.BeautifulSoup(example_html)>>>print(bs)<!DOCTYPEhtml><html><head><title>Testingwebsite</title><style>.b{color:blue;}</style></head><body><h1class="b"id="hhh">ABlueHeader</h1><p>Ilikebluetext,Ilikebluetext...</p><pclass="b">Thistextisblue,yayyayyay!</p><pclass="b">Checkoutthe<ahref="#hhh">BlueHeader</a></p></body></html>>>>print(bs.prettify())#adds in newlines<!DOCTYPEhtml><html><head><title>Testingwebsite</title><style>.b{color:blue;}</style></head><body><h1class="b"id="hhh">ABlueHeader</h1><p>Ilikebluetext,Ilikebluetext...</p><pclass="b">Thistextisblue,yayyayyay!</p><pclass="b">Checkoutthe<ahref="#hhh">BlueHeader</a></p></body></html>

Getting elements

There are two ways to access elements. The first way is to manually type in the tags, going down in order, until you get to the tag you want.

>>>print(bs.html)<html><head><title>Testingwebsite</title><style>.b{color:blue;}</style></head><body><h1class="b"id="hhh">ABlueHeader</h1><p>Ilikebluetext,Ilikebluetext...</p><pclass="b">Thistextisblue,yayyayyay!</p><pclass="b">Checkoutthe<ahref="#hhh">BlueHeader</a></p></body></html>>>>print(bs.html.body)<body><h1class="b"id="hhh">ABlueHeader</h1><p>Ilikebluetext,Ilikebluetext...</p><pclass="b">Thistextisblue,yayyayyay!</p><pclass="b">Checkoutthe<ahref="#hhh">BlueHeader</a></p></body>>>>print(bs.html.body.h1)

However, this is inconvenient with large html. There is a function, find_all, to find all instances of a certain element. It takes in a html tag, such as h1 or p, and returns all instances of it.

>>>p=bs.find_all('p')>>>p[<p>Ilikebluetext,Ilikebluetext...</p>,<pclass="b">Thistextisblue,yayyayyay!</p>,<pclass="b">Checkoutthe<ahref="#hhh">BlueHeader</a></p>]

This is still inconvenient in a large website because there will be thousands of entries. You can simplify it by finding classes or ids.

>>>blue=bs.find_all('p',_class='b')>>>blue[]

However, it does not bring up any results. Therefore, we might want to use our own finding system.

>>>p=bs.find_all('p')>>>p[<p>Ilikebluetext,Ilikebluetext...</p>,<pclass="b">Thistextisblue,yayyayyay!</p>,<pclass="b">Checkoutthe<ahref="#hhh">BlueHeader</a></p>]>>>blue=[pforpinpif'class'inp.__dict__['attrs']and'b'inp.__dict__['attrs']['class']]>>>blue[<pclass="b">Thistextisblue,yayyayyay!</p>,<pclass="b">Checkoutthe<ahref="#hhh">BlueHeader</a></p>]

This checks to see if there are any classes in each of the elements and then checks to see if the class b is in the classes if there are classes. From the list, we can do something to each element, such as retrieve the text inside.

>>>b=blue[0].text>>>print(bb)Thistextisblue,yayyayyay!