Web scraping spider

Question

I'm currently working on my first web scraping project and I need to scrape a lot of websites. With my current code it takes more than a day but for my project I need to scan the same websites every 5 days or so and 1 day of scanning is too long. From what I read so far my options are threading or something like an asynchronous requests library like grequests. Here is my code:

def sortiment(Supermarkt_link, Supermarkt_name): with open("progress.txt", "r") as t: lines = t.readlines() paused_loop_start = int(lines[0]) for e in range(paused_loop_start + 1, get_max_product_number(Supermarkt_link) ): json_list = [] page = requests.get(f"https://www.supermarktcheck.de{Supermarkt_link}sortiment/?page={e}") soup = BeautifulSoup(page.content, "html.parser") products_list = soup.find_all("a", class_="h3", href=True) for single_product in products_list: product_string = single_product["href"] product_page = requests.get(f"https://www.supermarktcheck.de/{product_string}")

This code scrapes a landing page and than scans each individual page on that landing page for information that are then saved into a database.

What does get_max_product_number do? Can you post the entire script? — C.Nivs, CommentedJul 14, 2023 at 16:56

Anay · Accepted Answer · 2023-07-13 10:50:49Z

Using asynchronous requests is a go to method.

But you can also check out multithreading. Why not take advantage of multi-core CPU's to maximize performance.

Kate · Accepted Answer · 2023-07-15 21:46:06Z

Even without changing this code, using sessions could boost performance.

And indeed you need to add some kind of parallel processing. Python has different options for that, and there are many different approaches - example.

If you can spread concurrent requests among different websites this is even better. You can normally crawl more than one page at a time on the same webserver but you don't want to overwhelm the server. And quite likely, there is rate limiting in place. Consider yourself lucky if you don't have to deal with Cloudflare captchas.

It's not clear what the purpose of json_list is, or paused_loop_start, or how get_max_product_number works. Maybe this function is slowing you down further.

Stack Exchange Network

Web scraping spider

2 Answers 2

Hot Network Questions

Web scraping spider

2 Answers 2

Related

Hot Network Questions