Web Scraper in Python

Question

So, this is my first web scraper (or part of it at least) and looking for things that I may have done wrong, or things that could be improved so I can learn from my mistakes.

I made a few short function that can take a user and search tpb to get the maximum number of pages of content that that user has by requesting the url, parsing the html for the links div, and then crawling through them to see if each page is valid or not (since some linked pages actually have no content).

I know that I haven't covered all cases such as invalid urls, non-existent users etc, just trying to find what I've done wrong so far in actually parsing existent content before I go further.

There's two main things I'm concerned about:

First, I have filter/map/lambda combos all over the place here. These are generally slow from what I remember about their efficiency, and they seemed to be a fairly easy and concise way to get the filtering I needed (although not the prettiest). So, is this acceptable and/or is there a better way with bs4 or another alternative?

Second, since beautifulsoup is recursive in finding nested tags anyways does calling my get_max_pages function recursively really matter here?

import requests, urllib, re from bs4 import BeautifulSoup BASE_USER = "https://thepiratebay.se/user/" def request_html(url): hdr = {'User-Agent': 'Mozilla/5.0'} request = urllib.request.Request(url, headers=hdr) html = urllib.request.urlopen(request) return html def get_url(*args): for arg in args: url = BASE_USER + "%s/" % arg return url def check_page_content(url): bs = BeautifulSoup(request_html(url), "html.parser") rows = bs.findAll("tr") rows = list(filter(None, map(lambda row: row if row.findChildren('div', {'class': 'detName'}) else None, rows))) return True if rows else False def get_max_pages(user, url=None, placeholder=0, links = list(), valid=True): if url is None: url = get_url(user, str(placeholder), 3) if valid: td = BeautifulSoup(request_html(url), "html.parser").find('td', {'colspan': 9, 'style':'text-align:center;'}) pg_nums = td.findAll('a', {'href': re.compile("/user/%s/\d{,3}/\d{,2}" % user)}) pages = list(filter(None, map(lambda a: int(a.text) - 1 if a.text else None, pg_nums))) if links: #Unnecessary to filter this at start as placeholder = 0 pages = list(filter(None, map(lambda x: x if int(x) > placeholder else None, pages))) if pages: i = 0 while valid and i < len(pages): element = pages[i] valid_page = check_page_content(get_url(user, element, 3)) if valid_page: links.append(element) else: valid = False i += 1 return get_max_pages(user, get_url(user, len(links), 3), len(links), links, valid) else: return links else: return links

Also, if anyone wouldn't mind would it be viable to split this into threads for say the content validation? I don't see how that would really improve anything all that much with python's GIL, but curious nonetheless.

To find ~35 pages of valid content it takes about 1.5 mins.

\$\begingroup\$Welcome to Code Review! Good job on your first question.\$\endgroup\$
– SirPython
CommentedNov 27, 2015 at 3:38 — SirPython, CommentedNov 27, 2015 at 3:38

Community · Accepted Answer · 2017-05-23 12:41:02Z

Your get_url function is confusing. It looks like you keep assigning new values to url and ignoring all the previous values. This is what happens when I run it:

>>> get_url("Hello", "World") 'https://thepiratebay.se/user/World/'

Surely this is either a bug or redundant behaviour, since the only argument that matters is the last one? It seems like instead you should be using str.join, which will concatenate all the arguments together with a string separator, so for example:

def get_url(*args): return BASE_USER + "/".join(args) >>> get_url("Hello", "World") 'https://thepiratebay.se/user/Hello/World' >>> get_url("ban", "an", "a") 'https://thepiratebay.se/user/ban/an/a'

Though as @Mathias Ettinger noted, you should use map(str, args) to ensure that all the arguments are converted to strings as join will raise errors if any of the arguments aren't strings.

check_page_content is hard to understand at first, a lot happens in one line so you should try spread it out to multiple lines. It would be easier to build rows with a list comprehension that filters in advance, rather than gathering up a lot of rows just to later remove them:

def check_page_content(url): bs = BeautifulSoup(request_html(url), "html.parser") rows = [row for row in bs.findAll("tr") if row.findChildren('div', {'class': 'detName'})]

But then I see that you're just returning the boolean value of the list! That means you can break as soon as you found that there's any row, no need to store the list at all. Luckily there's the any function that will do this for you. It supports short circuiting, which means that it will return as soon as it finds a condition that evaluates as True:

def check_page_content(url): bs = BeautifulSoup(request_html(url), "html.parser") return any(row.findChildren('div', {'class': 'detName'}) for row in bs.findAll("tr"))

any will apply truthiness to all the values in bs.findAll so if any of them have truthy results then your function will immediately return. Even if you have to check every single row, this is faster than building a full list, mapping and filtering it.

In get_max_pages you have the default value links=list(). I'm not sure if you know about the mutable default and thought this would avoid it but it wont. links will be created once as an empty list. Every time you call the function the same list exists so the same list will be appended to, which is not what you need. Here's a simple example:

>>> def a(b=list()): b.append("another") return b >>> a() ['another'] >>> a() ['another', 'another'] >>> a() ['another', 'another', 'another']

Instead you need to use links=None and then use if links is None: links = [] so that a new list is created within each function call.

You have most of your code nested inside if valid, but if you flipped it around you could save nesting by doing this:

def get_max_pages(user, url=None, placeholder=0, links = list(), valid=True): if url is None: url = get_url(user, str(placeholder), 3) if not valid: return links td = BeautifulSoup(request_html(url), "html.parser").find('td', {'colspan': 9, 'style':'text-align:center;'})

Less nesting generally makes it easier to read code like this.

You have another map filter, I'd recommend doing a similar list comprehension, though it's a little harder to follow:

 regex = re.compile("/user/%s/\d{,3}/\d{,2}" % user) pages = [int(a.text) - 1 for a in td.findAll('a', {'href': regex}) if a.text]

What's happening here is that I'm iterating through td.findAll, and if a.text is True then I'm storing int(a.text) - 1 in pages. This does what your next line does without needing multiple calls.

However, you then may need to further filter pages based on a condition. You could still do this in a comprehension though, just run a list comprehension over pages itself:

if links: pages = [x for x in pages if x > placeholder]

Note I removed the int call because you already store them as integers to it's not necessary.

Lastly, your while loop here is strange. It would be easier to just do for element in pages and then break from the loop instead of setting valid = False. This gives you a much simpler loop with less lines:

 for element in pages: valid_page = check_page_content(get_url(user, element, 3)) if valid_page: links.append(element) else: break

Damn, pretty much the same speech. Except 4 minutes faster… What are the guidelines in such cases, should I strip out my answer to reference yours and only keep the part on recursion? — 301_Moved_Permanently, CommentedNov 27, 2015 at 10:59
@MathiasEttinger I think it's ok to keep the repetition. We clearly didn't plagiarise each other, and it might be helpful for the OP to see two different responses to the same points (for instance I didn't think to map(str, args) in get_url so I think we should keep both unless we see actual problems with our suggestions that the other person did better. — SuperBiasedMan, CommentedNov 27, 2015 at 11:03

301_Moved_Permanently · Accepted Answer · 2015-11-27 10:52:56Z

The general layout is neat, naming is good, all in all it is quite pleasant to read through. You may want to reduce you maximum line length to match PEP8's 79 characters suggestion, but other than that it's visually appealing.

get_url

I'm not sure what you are expecting out of this function, but it doesn't seem to behave like you might expect it to:

>>> get_url('test', 0, 3) 'https://thepiratebay.se/user/3/'

If you expected it to output 'https://thepiratebay.se/user/test/0/3/' then you should fix it.

def get_url(*args): url = BASE_USER for arg in args: url += str(arg) + "/" return url

Is the version I understand based on the rest of your code. (Especially the re part.) But it can be improved. Basically, all you want to do is to join all arguments in args using '/' as a separator and then append that to BASE_USER:

def get_url(*args): return BASE_USER + '/'.join(map(str,args))

request_html

This function is fine but, looking at your imports, you seems to be using requests in other parts of your code. It might be better to use a unified approach and use only one of the two modules.

check_page_content

I agree with your concerns that list + filter + map + lambda might not be the best here. First of, using a list-comprehension or a generator expression is recommended over map + lambda. It may not change anything at runtime but it is clearer in the intent. Since you use list eventually, I’d go with a list-comprehension up-front. And use their capabilities to incorporate the filter into it as well:

rows = [row for row in rows if row.findChildren('div', {'class': 'detName'})]

But you don't even need to build a list, all you want to know is if this list contains at least an item. All you want to know is if any element of rows does contain a div whose class is detName:

return any(row.findChildren('div', {'class': 'detName'}) for row in rows)

get_max_pages

First of, having a parameter (valid) telling the function if it must do its job or do nothing sounds like a very bad idea. Better not call the function if it has nothing to do:

def get_max_pages(user, url=None, placeholder=0, links = list()): ... while valid and i < len(pages): ... if valid: return get_max_pages(...) else: return links

Secondly, using a mutable default argument is generally a bug. Here it works fine because you’re not using the default value in your recursive calls but you might encounter undesired behaviors when calling the function twice using default arguments:

>>> def simple_test(x, some_list = list()): ... some_list.append(x) ... print(some_list) ... >>> simple_test(3) [3] >>> simple_test(6) # Uh, we wanted [6], not that [3, 6] >>> simple_test(6, []) [6]

This is simply because default arguments are evaluated once when the function is defined and not each time the function is called. The general work-around is:

def my_function(some_list=None): if some_list is None: some_list = [] ...

Next, you can simplify the whole logic around your pages. Even though you use list to create it, I would suggest a generator expression here since you’re only interested in iterating over it once:

pages = (int(a.text) - 1 for a in pg_nums if a.text) if links: # pages already contains integers pages = filter(lambda x: x>placeholder, pages) for element in pages: is_valid = check_page_content(get_url(user, element, 3)) if is_valid: links.append(element) else: break

Here I’m using a for loop because it works best with generator expressions but mostly because it makes the intent clearer. There is no need for the if pages test since the loop will be a no-op if there is no elements to process. Since we are also interested in recursing only if all elements are valid, we need a way to know if we broke out of the loop, and if not do the recursion. Did you know that for and while loops could have an else clause that let us do exactly that? The else will run only if no break occurred during the loop:

pages = (int(a.text) - 1 for a in pg_nums if a.text) if links: # pages already contains integers pages = filter(lambda x: x>placeholder, pages) for element in pages: if check_page_content(get_url(user, element, 3)): links.append(element) else: break else: return get_max_pages(...) return links # Still backing up to returning this since we remove `valid` from the parameters list.

Recursion

You don't make any use of the returned value of get_max_pages within your recursive calls (I’ll make the assumption that you need the built links list somewhere else) and you pass the state (i.e: links) at each recursive call.

That let me thinking that you’d better of use a plain old while loop instead, since you only use recursion to go to the next item. You can thus get rid of all your optional parameters and build them as you go:

def get_max_pages(user): links = [] url = get_url(user, 0, 3) regex = re.compile("/user/%s/\d{,3}/\d{,2}" % user) while True: # Loop forever, we will return out of it when needed soup = BeautifulSoup(request_html(url), "html.parser") td = soup.find('td', {'colspan': 9, 'style':'text-align:center;'}) pg_nums = td.findAll('a', {'href': regex}) pages = (int(a.text) - 1 for a in pg_nums if a.text) if links: pages = filter(lambda x: x>placeholder, pages) for element in pages: if check_page_content(get_url(user, element, 3)): links.append(element) else: break else: url = get_url(user, len(links), 3) continue # Go for another page # One page did not fit, stop there return links

Stack Exchange Network

Web Scraper in Python

2 Answers 2

get_url

request_html

check_page_content

get_max_pages

Recursion

Hot Network Questions

Web Scraper in Python

2 Answers 2

get_url

request_html

check_page_content

get_max_pages

Recursion

Related

Hot Network Questions