3
\$\begingroup\$

I have written a web scraping script using Selenium to crawl blog content from multiple URLs. The script processes URLs in batches of 1000 and uses multithreading with the ThreadPoolExecutor to improve performance. It also handles graceful termination with signal handling to save progress in case of interruptions.

Key Features of the Code:

  • Headless Chrome Driver: Configured for faster performance.
  • Blocking Media Files: Prevents loading unnecessary resources like images and videos.
  • Multithreading: Processes multiple URLs simultaneously to reduce execution time.
  • Progress Saving: Saves intermediate results to a CSV file during execution and before exiting.
  • Error Handling and Logging: Captures errors and logs details for debugging.

Issue:

Despite these optimizations, the execution time is still slower than expected when processing a large number of URLs. Each URL takes several seconds to fetch content, which adds up significantly for thousands of URLs.

Questions:

  1. How can I further reduce execution time for this multi-page crawling script?
  2. Are there any specific optimizations I can apply to improve Selenium's performance, especially when handling iframes and dynamic content?
import multiprocessing from concurrent.futures import ThreadPoolExecutor, as_completed from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import time import pandas as pd import logging import random import signal import sys # log logging.basicConfig(filename='crawler.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') # chrome driver chrome_options = webdriver.ChromeOptions() chrome_options.add_argument("--headless") chrome_options.add_argument("--no-sandbox") chrome_options.add_argument("--disable-dev-shm-usage") def start_driver(): driver = webdriver.Chrome(options=chrome_options) driver.execute_cdp_cmd('Network.enable', {}) try: driver.execute_cdp_cmd('Network.setBlockedURLs', { "urls": ["*.png", "*.jpg", "*.jpeg", "*.gif", "*.webp", "*.mp4", "*.avi", "*.mkv", "*.mov"] }) except Exception as e: logging.error(f"Error setting blocked URLs: {e}") return driver # scraping def scrap_blog_content(url): driver = start_driver() try: driver.get(url) WebDriverWait(driver, 5).until( EC.presence_of_element_located((By.CSS_SELECTOR, "iframe")) ) iframe = driver.find_element(By.CSS_SELECTOR, "iframe") driver.switch_to.frame(iframe) WebDriverWait(driver, 5).until( EC.presence_of_element_located((By.CSS_SELECTOR, "div.se-main-container")) ) content = driver.find_element(By.CSS_SELECTOR, "div.se-main-container").text time.sleep(random.uniform(1, 5)) return content except Exception as e: logging.error(f"Error while fetching content from {url}: {e}") return None finally: driver.quit() # thread def process_urls(urls): results = [] with ThreadPoolExecutor(max_workers=8) as executor: future_to_url = {executor.submit(crawl_blog_content, url): url for url in urls} for future in as_completed(future_to_url): url = future_to_url[future] try: content = future.result() if content: results.append((url, content)) logging.info(f"Successfully crawled: {url}") except Exception as exc: logging.error(f"Error fetching {url}: {exc}") return results # result global_results = [] output_file = 'contents_202101.csv' # temp save def save_progress(): if global_results: temp_df = pd.DataFrame(global_results, columns=['URL', 'Content']) temp_df.to_csv(output_file, index=False) logging.info(f"Progress saved with {len(global_results)} entries.") # exit def signal_handler(sig, frame): logging.info("Termination signal received. Saving progress...") save_progress() sys.exit(0) signal.signal(signal.SIGINT, signal_handler) signal.signal(signal.SIGTERM, signal_handler) if __name__ == "__main__": input_file = 'url_202101.csv' urls_df = pd.read_csv(input_file) urls = urls_df['URL'].tolist() batch_size = 1000 # batch url_chunks = [urls[i:i + batch_size] for i in range(0, len(urls), batch_size)] for idx, chunk in enumerate(url_chunks): logging.info(f"Processing batch {idx + 1}/{len(url_chunks)}") results = process_urls(chunk) global_results.extend(results) save_progress() logging.info(f"Batch {idx + 1} saved with {len(global_results)} entries.") save_progress() logging.info(f"Final results saved to {output_file} with {len(global_results)} entries.")```
\$\endgroup\$

    1 Answer 1

    5
    \$\begingroup\$

    Performance

    I will skip over the usual suggestions concerning adding docstrings to the module and functions, using type hinting and adding comments where they would be useful to the reader and proceed directly to your issue with performance.

    You are doing a couple of things that are hurting performance:

    1. You are taking the URLs you want to process and breaking them up into batches of a 1000 that you then submit to process_urls. Each submission results in creation or a re-creation of the multithreading pool. Creating threads are less expensive than creating processes so I can't say that if you were to restructure your code so that you could reuse a single pool you would be making a huge impact in performance. But for a following suggestion I will be making below, having a single, reusable pool will be required for best performance. Is there a particular reason why you are even submitting the URLs in batches? If so, I see no reason why you still cannot use a single pool.
    2. Each URL you submit is creating a new Chrome driver. Your submit call references a function crawl_blog_content, which is undefined. I suspect this is supposed to be scrap_blog_content, which is defined (by the way, scrape_blog_content would be a better name as the verb scrap means to get rid of, and you certainly do not want to do that). Every time you create a new Chrome driver a new process is being created, which is expensive, and the driver has to execute initialization code before it can take your requests. It would be ideal if we can reuse drivers. So if you are creating a pool of 8 threads, you would need to create 8 reusable drivers, i.e. one per thread.

    The way to achieve a single driver per thread that is reusable is to use a pool initializer function that is invoked for each thread in the pool before it starts processing submitted tasks. This function would create a Chrome driver and store it in thread local storage that is unique for each thread. The only complication is that when all submitted tasks have been completed and the pool is terminated, we would like to call quit on these drivers so that they are terminated instead of lying around even after your script terminates. The way to do that is to enclose the driver in a wrapper class that defines a __del__ method that will "quit" the driver when the wrapper is garbage collected, which will occur when thread local storage is garbage collected, which will occur when your thread pool is terminated.

    Here is the basic code:

    import threading ... class DriverWrapper: def __init__(self): chrome_options = webdriver.ChromeOptions() chrome_options.add_argument("--headless") chrome_options.add_argument("--no-sandbox") chrome_options.add_argument("--disable-dev-shm-usage") self.driver = webdriver.Chrome(options=chrome_options) self.driver.execute_cdp_cmd('Network.enable', {}) try: self.driver.execute_cdp_cmd('Network.setBlockedURLs', { "urls": ["*.png", "*.jpg", "*.jpeg", "*.gif", "*.webp", "*.mp4", "*.avi", "*.mkv", "*.mov"] }) except Exception as e: logging.error(f"Error setting blocked URLs: {e}") def __del__(self): self.driver.quit() thread_local = threading.local() def init_pool(): thread_local.driver_wrapper = DriverWrapper() def get_driver(): return thread_localthread_local.driver_wrapper.driver 

    So process_urls becomes:

    def process_urls(urls): results = [] # Specify a pool initializer: with ThreadPoolExecutor(max_workers=8, initializer=init_pool) as executor: future_to_url = {executor.submit(crawl_blog_content, url): url for url in urls} ... 

    scrap_blow_content now becomes:

    def scrap_blog_content(url): # The start_driver function is no longer used; the code has been # moved to init_pool: driver = get_driver() # Get the driver from thread local storage try: driver.get(url) ... except Exception as e: logging.error(f"Error while fetching content from {url}: {e}") return None # The finally block that quits the driver has been removed 

    Note that there is no longer a call to driver.quit() in the above function.

    Finally, if you can, do not batch the URLs; we would like to invoke scrap_blog_content only once so that we do not have to recreate the multiprocessing pool and thus the Chrome drivers. But if you must create batches, create the pool once in your if __name__ == "__main__": block and pass it to scrap_blog_content.

    Question

    Why do you have a call to time.sleep(random.uniform(1, 5)) in scrap_blog_content?

    \$\endgroup\$
    2
    • \$\begingroup\$time.sleep could be to comply with robots.txt/scraping rules and be generally more polite\$\endgroup\$
      – C.Nivs
      CommentedDec 27, 2024 at 13:56
    • \$\begingroup\$@C.Nivs If a robots.txt file were present and had a Crawl-delay value, why not just use that? Sleeping unnecessarily is certainly not going to help performance. Also this sleeping is done for each invocation of scrap_blog_content after the URL is fetched. What if the URL is either the only one or the last one for a given website. This after-the-fact sleeping is totally unnecessary.\$\endgroup\$
      – Booboo
      CommentedDec 27, 2024 at 15:07

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.