I'm currently working on my first web scraping project and I need to scrape a lot of websites. With my current code it takes more than a day but for my project I need to scan the same websites every 5 days or so and 1 day of scanning is too long. From what I read so far my options are threading or something like an asynchronous requests library like grequests. Here is my code:
def sortiment(Supermarkt_link, Supermarkt_name): with open("progress.txt", "r") as t: lines = t.readlines() paused_loop_start = int(lines[0]) for e in range(paused_loop_start + 1, get_max_product_number(Supermarkt_link) ): json_list = [] page = requests.get(f"https://www.supermarktcheck.de{Supermarkt_link}sortiment/?page={e}") soup = BeautifulSoup(page.content, "html.parser") products_list = soup.find_all("a", class_="h3", href=True) for single_product in products_list: product_string = single_product["href"] product_page = requests.get(f"https://www.supermarktcheck.de/{product_string}")
This code scrapes a landing page and than scans each individual page on that landing page for information that are then saved into a database.
get_max_product_number
do? Can you post the entire script?\$\endgroup\$