Jump to content

Python Programming/Web

From Wikibooks, open books for an open world

Python web requests/parsing is very simple, and there are several must-have modules to help with this.

Urllib

[edit | edit source]

Urllib is the built in python module for html requests, main article is Python Programming/Internet.

try:importurllib2except(ModuleNotFoundError,ImportError):#ModuleNotFoundError is 3.6+importurllib.parseasurllib2url='https://www.google.com'u=urllib2.urlopen(url)content=u.read()#content now has all of the html in google.com

Requests

[edit | edit source]
requests
Python HTTP for Humans
PyPi Linkhttps://pypi.python.org/pypi/requests
Pip commandpip install requests

The python requests library simplifies http requests. It has functions for each of the http requests

  • GET (requests.get)
  • POST (requests.post)
  • HEAD (requests.head)
  • PUT (requests.put)
  • DELETE (requests.delete)
  • OPTIONS (requests.options)

Basic request

[edit | edit source]
importrequestsurl='https://www.google.com'r=requests.get(url)

The response object

[edit | edit source]

The response from the last function has many variables/data retrieval.

>>>importrequests>>>r=requests.get('https://www.google.com')>>>print(r)<Response[200]>>>>dir(r)# dir shows all variables, functions, basically anything you can do with var.n where n is something to do['__attrs__','__bool__','__class__','__delattr__','__dict__','__dir__','__doc__','__enter__','__eq__','__exit__','__format__','__ge__','__getattribute__','__getstate__','__gt__','__hash__','__init__','__init_subclass__','__iter__','__le__','__lt__','__module__','__ne__','__new__','__nonzero__','__reduce__','__reduce_ex__','__repr__','__setattr__','__setstate__','__sizeof__','__str__','__subclasshook__','__weakref__','_content','_content_consumed','_next','apparent_encoding','close','connection','content','cookies','elapsed','encoding','headers','history','is_permanent_redirect','is_redirect','iter_content','iter_lines','json','links','next','ok','raise_for_status','raw','reason','request','status_code','text','url']
  • r.content and r.text provide similar html content, but r.text is preferred.
  • r.encoding will display the encoding of the website.
  • r.headers shows the headers returned by the website.
  • r.is_redirect and r.is_permanent_redirect shows whether or not the original link was a redirect.
  • r.iter_content will iterate each character in the html as a byte. To convert bytes to string, it must be decoded with the encoding in r.encoding.
  • r.iter_lines is like r.iter_content, but will iterate each line of the html. It is also in bytes
  • r.json will convert json to a python dict if the return output is json.
  • r.rawwill return the base urllib3.response.HTTPResponse object.
  • r.status_code will return the html code sent by the server. Code 200 is success, while any other code is an error. r.raise_for_status will return an exception if the status code is not 200.
  • r.url will return the url sent.

Authentication

[edit | edit source]

Requests has built-in authentication. Here is an example with basic authentication.

importrequestsr=requests.get('http://example.com',auth=requests.auth.HTTPBasicAuth('username','password'))

If it is Basic Authentication, you can just pass a tuple.

importrequestsr=requests.get('http://example.com',auth=('username','password'))

All of the other types of authentication are at the requests documentation.

Queries

[edit | edit source]

Queries in html pass values. For example, when you make a google search, the search url is a form of https://www.google.com/search?q=My+Search+Here&.... Anything after the ? is the query. Queries are url?name1=value1&name2=value2.... Requests has a system for automatically making these queries.

>>>importrequests>>>query={'q':'test'}>>>r=requests.get('https://www.google.com/search',params=query)>>>print(r.url)#prints the final urlhttps://www.google.com/search?q=test

The true power is noticed in multiple entries.

>>>importrequests>>>query={'name':'test','fakeparam':'yes','anotherfakeparam':'yes again'}>>>r=requests.get('http://example.com',params=query)>>>print(r.url)#prints the final urlhttp://example.com/?name=test&fakeparam=yes&anotherfakeparam=yes+again

Not only does it pass these values but also changes special characters & whitespace to html-compatible versions.

BeautifulSoup4

[edit | edit source]
beautifulsoup4
Screen-scraping library
PyPi Linkhttps://pypi.python.org/pypi/beautifulsoup4
Pip commandpip install beautifulsoup4
Import commandimport bs4

BeautifulSoup4 is a powerful html parsing command. Let's try with some example html.

>>>importbs4>>>example_html="""<!DOCTYPE html>... <html>... <head>... <title>Testing website</title>... <style>.b{color: blue;}</style>... </head>... <body>... <h1 class='b', id = 'hhh'>A Blue Header</h1>... <p> I like blue text, I like blue text... </p>... <p class = 'b'> This text is blue, yay yay yay!</p>... <p class = 'b'>Check out the <a href = '#hhh'>Blue Header</a></p>... </body>... </html>... """>>>bs=bs4.BeautifulSoup(example_html)>>>print(bs)<!DOCTYPEhtml><html><head><title>Testingwebsite</title><style>.b{color:blue;}</style></head><body><h1class="b"id="hhh">ABlueHeader</h1><p>Ilikebluetext,Ilikebluetext...</p><pclass="b">Thistextisblue,yayyayyay!</p><pclass="b">Checkoutthe<ahref="#hhh">BlueHeader</a></p></body></html>>>>print(bs.prettify())#adds in newlines<!DOCTYPEhtml><html><head><title>Testingwebsite</title><style>.b{color:blue;}</style></head><body><h1class="b"id="hhh">ABlueHeader</h1><p>Ilikebluetext,Ilikebluetext...</p><pclass="b">Thistextisblue,yayyayyay!</p><pclass="b">Checkoutthe<ahref="#hhh">BlueHeader</a></p></body></html>

Getting elements

[edit | edit source]

There are two ways to access elements. The first way is to manually type in the tags, going down in order, until you get to the tag you want.

>>>print(bs.html)<html><head><title>Testingwebsite</title><style>.b{color:blue;}</style></head><body><h1class="b"id="hhh">ABlueHeader</h1><p>Ilikebluetext,Ilikebluetext...</p><pclass="b">Thistextisblue,yayyayyay!</p><pclass="b">Checkoutthe<ahref="#hhh">BlueHeader</a></p></body></html>>>>print(bs.html.body)<body><h1class="b"id="hhh">ABlueHeader</h1><p>Ilikebluetext,Ilikebluetext...</p><pclass="b">Thistextisblue,yayyayyay!</p><pclass="b">Checkoutthe<ahref="#hhh">BlueHeader</a></p></body>>>>print(bs.html.body.h1)

However, this is inconvenient with large html. There is a function, find_all, to find all instances of a certain element. It takes in a html tag, such as h1 or p, and returns all instances of it.

>>>p=bs.find_all('p')>>>p[<p>Ilikebluetext,Ilikebluetext...</p>,<pclass="b">Thistextisblue,yayyayyay!</p>,<pclass="b">Checkoutthe<ahref="#hhh">BlueHeader</a></p>]

This is still inconvenient in a large website because there will be thousands of entries. You can simplify it by finding classes or ids.

>>>blue=bs.find_all('p',_class='b')>>>blue[]

However, it does not bring up any results. Therefore, we might want to use our own finding system.

>>>p=bs.find_all('p')>>>p[<p>Ilikebluetext,Ilikebluetext...</p>,<pclass="b">Thistextisblue,yayyayyay!</p>,<pclass="b">Checkoutthe<ahref="#hhh">BlueHeader</a></p>]>>>blue=[pforpinpif'class'inp.__dict__['attrs']and'b'inp.__dict__['attrs']['class']]>>>blue[<pclass="b">Thistextisblue,yayyayyay!</p>,<pclass="b">Checkoutthe<ahref="#hhh">BlueHeader</a></p>]

This checks to see if there are any classes in each of the elements and then checks to see if the class b is in the classes if there are classes. From the list, we can do something to each element, such as retrieve the text inside.

>>>b=blue[0].text>>>print(bb)Thistextisblue,yayyayyay!
close