1
\$\begingroup\$

I'm currently working on my first web scraping project and I need to scrape a lot of websites. With my current code it takes more than a day but for my project I need to scan the same websites every 5 days or so and 1 day of scanning is too long. From what I read so far my options are threading or something like an asynchronous requests library like grequests. Here is my code:

def sortiment(Supermarkt_link, Supermarkt_name): with open("progress.txt", "r") as t: lines = t.readlines() paused_loop_start = int(lines[0]) for e in range(paused_loop_start + 1, get_max_product_number(Supermarkt_link) ): json_list = [] page = requests.get(f"https://www.supermarktcheck.de{Supermarkt_link}sortiment/?page={e}") soup = BeautifulSoup(page.content, "html.parser") products_list = soup.find_all("a", class_="h3", href=True) for single_product in products_list: product_string = single_product["href"] product_page = requests.get(f"https://www.supermarktcheck.de/{product_string}") 

This code scrapes a landing page and than scans each individual page on that landing page for information that are then saved into a database.

\$\endgroup\$
2
  • \$\begingroup\$What does get_max_product_number do? Can you post the entire script?\$\endgroup\$
    – C.Nivs
    CommentedJul 14, 2023 at 16:56
  • 2
    \$\begingroup\$Please add all the imports required to run this script.\$\endgroup\$CommentedJul 14, 2023 at 17:59

2 Answers 2

1
\$\begingroup\$

Using asynchronous requests is a go to method.

But you can also check out multithreading. Why not take advantage of multi-core CPU's to maximize performance.

\$\endgroup\$
    1
    \$\begingroup\$

    Even without changing this code, using sessions could boost performance.

    And indeed you need to add some kind of parallel processing. Python has different options for that, and there are many different approaches - example.

    If you can spread concurrent requests among different websites this is even better. You can normally crawl more than one page at a time on the same webserver but you don't want to overwhelm the server. And quite likely, there is rate limiting in place. Consider yourself lucky if you don't have to deal with Cloudflare captchas.

    It's not clear what the purpose of json_list is, or paused_loop_start, or how get_max_product_number works. Maybe this function is slowing you down further.

    \$\endgroup\$

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.