Python Programming/Web
Python web requests/parsing is very simple, and there are several must-have modules to help with this.
Urllib
[edit | edit source]Urllib is the built in python module for html requests, main article is Python Programming/Internet.
try:importurllib2except(ModuleNotFoundError,ImportError):#ModuleNotFoundError is 3.6+importurllib.parseasurllib2url='https://www.google.com'u=urllib2.urlopen(url)content=u.read()#content now has all of the html in google.com
Requests
[edit | edit source]Python HTTP for Humans | |
PyPi Link | https://pypi.python.org/pypi/requests |
---|---|
Pip command | pip install requests |
The python requests library simplifies http requests. It has functions for each of the http requests
- GET (requests.get)
- POST (requests.post)
- HEAD (requests.head)
- PUT (requests.put)
- DELETE (requests.delete)
- OPTIONS (requests.options)
Basic request
[edit | edit source]importrequestsurl='https://www.google.com'r=requests.get(url)
The response object
[edit | edit source]The response from the last function has many variables/data retrieval.
>>>importrequests>>>r=requests.get('https://www.google.com')>>>print(r)<Response[200]>>>>dir(r)# dir shows all variables, functions, basically anything you can do with var.n where n is something to do['__attrs__','__bool__','__class__','__delattr__','__dict__','__dir__','__doc__','__enter__','__eq__','__exit__','__format__','__ge__','__getattribute__','__getstate__','__gt__','__hash__','__init__','__init_subclass__','__iter__','__le__','__lt__','__module__','__ne__','__new__','__nonzero__','__reduce__','__reduce_ex__','__repr__','__setattr__','__setstate__','__sizeof__','__str__','__subclasshook__','__weakref__','_content','_content_consumed','_next','apparent_encoding','close','connection','content','cookies','elapsed','encoding','headers','history','is_permanent_redirect','is_redirect','iter_content','iter_lines','json','links','next','ok','raise_for_status','raw','reason','request','status_code','text','url']
r.content
andr.text
provide similar html content, butr.text
is preferred.r.encoding
will display the encoding of the website.r.headers
shows the headers returned by the website.r.is_redirect
andr.is_permanent_redirect
shows whether or not the original link was a redirect.r.iter_content
will iterate each character in the html as a byte. To convert bytes to string, it must be decoded with the encoding inr.encoding
.r.iter_lines
is liker.iter_content
, but will iterate each line of the html. It is also in bytesr.json
will convert json to a python dict if the return output is json.r.raw
will return the baseurllib3.response.HTTPResponse
object.r.status_code
will return the html code sent by the server. Code 200 is success, while any other code is an error.r.raise_for_status
will return an exception if the status code is not 200.r.url
will return the url sent.
Authentication
[edit | edit source]Requests has built-in authentication. Here is an example with basic authentication.
importrequestsr=requests.get('http://example.com',auth=requests.auth.HTTPBasicAuth('username','password'))
If it is Basic Authentication, you can just pass a tuple.
importrequestsr=requests.get('http://example.com',auth=('username','password'))
All of the other types of authentication are at the requests documentation.
Queries
[edit | edit source] Queries in html pass values. For example, when you make a google search, the search url is a form of https://www.google.com/search?q=My+Search+Here&...
. Anything after the ? is the query. Queries are url?name1=value1&name2=value2...
. Requests has a system for automatically making these queries.
>>>importrequests>>>query={'q':'test'}>>>r=requests.get('https://www.google.com/search',params=query)>>>print(r.url)#prints the final urlhttps://www.google.com/search?q=test
The true power is noticed in multiple entries.
>>>importrequests>>>query={'name':'test','fakeparam':'yes','anotherfakeparam':'yes again'}>>>r=requests.get('http://example.com',params=query)>>>print(r.url)#prints the final urlhttp://example.com/?name=test&fakeparam=yes&anotherfakeparam=yes+again
Not only does it pass these values but also changes special characters & whitespace to html-compatible versions.
BeautifulSoup4
[edit | edit source]Screen-scraping library | |
PyPi Link | https://pypi.python.org/pypi/beautifulsoup4 |
---|---|
Pip command | pip install beautifulsoup4 |
Import command | import bs4 |
BeautifulSoup4 is a powerful html parsing command. Let's try with some example html.
>>>importbs4>>>example_html="""<!DOCTYPE html>... <html>... <head>... <title>Testing website</title>... <style>.b{color: blue;}</style>... </head>... <body>... <h1 class='b', id = 'hhh'>A Blue Header</h1>... <p> I like blue text, I like blue text... </p>... <p class = 'b'> This text is blue, yay yay yay!</p>... <p class = 'b'>Check out the <a href = '#hhh'>Blue Header</a></p>... </body>... </html>... """>>>bs=bs4.BeautifulSoup(example_html)>>>print(bs)<!DOCTYPEhtml><html><head><title>Testingwebsite</title><style>.b{color:blue;}</style></head><body><h1class="b"id="hhh">ABlueHeader</h1><p>Ilikebluetext,Ilikebluetext...</p><pclass="b">Thistextisblue,yayyayyay!</p><pclass="b">Checkoutthe<ahref="#hhh">BlueHeader</a></p></body></html>>>>print(bs.prettify())#adds in newlines<!DOCTYPEhtml><html><head><title>Testingwebsite</title><style>.b{color:blue;}</style></head><body><h1class="b"id="hhh">ABlueHeader</h1><p>Ilikebluetext,Ilikebluetext...</p><pclass="b">Thistextisblue,yayyayyay!</p><pclass="b">Checkoutthe<ahref="#hhh">BlueHeader</a></p></body></html>
Getting elements
[edit | edit source]There are two ways to access elements. The first way is to manually type in the tags, going down in order, until you get to the tag you want.
>>>print(bs.html)<html><head><title>Testingwebsite</title><style>.b{color:blue;}</style></head><body><h1class="b"id="hhh">ABlueHeader</h1><p>Ilikebluetext,Ilikebluetext...</p><pclass="b">Thistextisblue,yayyayyay!</p><pclass="b">Checkoutthe<ahref="#hhh">BlueHeader</a></p></body></html>>>>print(bs.html.body)<body><h1class="b"id="hhh">ABlueHeader</h1><p>Ilikebluetext,Ilikebluetext...</p><pclass="b">Thistextisblue,yayyayyay!</p><pclass="b">Checkoutthe<ahref="#hhh">BlueHeader</a></p></body>>>>print(bs.html.body.h1)
However, this is inconvenient with large html. There is a function, find_all, to find all instances of a certain element. It takes in a html tag, such as h1 or p, and returns all instances of it.
>>>p=bs.find_all('p')>>>p[<p>Ilikebluetext,Ilikebluetext...</p>,<pclass="b">Thistextisblue,yayyayyay!</p>,<pclass="b">Checkoutthe<ahref="#hhh">BlueHeader</a></p>]
This is still inconvenient in a large website because there will be thousands of entries. You can simplify it by finding classes or ids.
>>>blue=bs.find_all('p',_class='b')>>>blue[]
However, it does not bring up any results. Therefore, we might want to use our own finding system.
>>>p=bs.find_all('p')>>>p[<p>Ilikebluetext,Ilikebluetext...</p>,<pclass="b">Thistextisblue,yayyayyay!</p>,<pclass="b">Checkoutthe<ahref="#hhh">BlueHeader</a></p>]>>>blue=[pforpinpif'class'inp.__dict__['attrs']and'b'inp.__dict__['attrs']['class']]>>>blue[<pclass="b">Thistextisblue,yayyayyay!</p>,<pclass="b">Checkoutthe<ahref="#hhh">BlueHeader</a></p>]
This checks to see if there are any classes in each of the elements and then checks to see if the class b is in the classes if there are classes. From the list, we can do something to each element, such as retrieve the text inside.
>>>b=blue[0].text>>>print(bb)Thistextisblue,yayyayyay!