So, this is my first web scraper (or part of it at least) and looking for things that I may have done wrong, or things that could be improved so I can learn from my mistakes.
I made a few short function that can take a user and search tpb to get the maximum number of pages of content that that user has by requesting the url, parsing the html for the links div, and then crawling through them to see if each page is valid or not (since some linked pages actually have no content).
I know that I haven't covered all cases such as invalid urls, non-existent users etc, just trying to find what I've done wrong so far in actually parsing existent content before I go further.
There's two main things I'm concerned about:
First, I have filter/map/lambda combos all over the place here. These are generally slow from what I remember about their efficiency, and they seemed to be a fairly easy and concise way to get the filtering I needed (although not the prettiest). So, is this acceptable and/or is there a better way with bs4 or another alternative?
Second, since beautifulsoup is recursive in finding nested tags anyways does calling my get_max_pages function recursively really matter here?
import requests, urllib, re from bs4 import BeautifulSoup BASE_USER = "https://thepiratebay.se/user/" def request_html(url): hdr = {'User-Agent': 'Mozilla/5.0'} request = urllib.request.Request(url, headers=hdr) html = urllib.request.urlopen(request) return html def get_url(*args): for arg in args: url = BASE_USER + "%s/" % arg return url def check_page_content(url): bs = BeautifulSoup(request_html(url), "html.parser") rows = bs.findAll("tr") rows = list(filter(None, map(lambda row: row if row.findChildren('div', {'class': 'detName'}) else None, rows))) return True if rows else False def get_max_pages(user, url=None, placeholder=0, links = list(), valid=True): if url is None: url = get_url(user, str(placeholder), 3) if valid: td = BeautifulSoup(request_html(url), "html.parser").find('td', {'colspan': 9, 'style':'text-align:center;'}) pg_nums = td.findAll('a', {'href': re.compile("/user/%s/\d{,3}/\d{,2}" % user)}) pages = list(filter(None, map(lambda a: int(a.text) - 1 if a.text else None, pg_nums))) if links: #Unnecessary to filter this at start as placeholder = 0 pages = list(filter(None, map(lambda x: x if int(x) > placeholder else None, pages))) if pages: i = 0 while valid and i < len(pages): element = pages[i] valid_page = check_page_content(get_url(user, element, 3)) if valid_page: links.append(element) else: valid = False i += 1 return get_max_pages(user, get_url(user, len(links), 3), len(links), links, valid) else: return links else: return links
Also, if anyone wouldn't mind would it be viable to split this into threads for say the content validation? I don't see how that would really improve anything all that much with python's GIL, but curious nonetheless.
To find ~35 pages of valid content it takes about 1.5 mins.