Web Scraping Dynamically Generated Content Python

Question

This is a small web scraping project I made in 2 hours that targets the website remote.co . I am looking forward for improvements in my code. I know about the inconsistency with the WebDriverWait and time.sleep() waits, but when I used the WebDriverWait to wait until the load_more button was clickable and ran the program selenium crashed my webdriver window and continuously spammed my terminal window with 20-30 lines of seemingly useless text.

import scrapy from selenium import webdriver from selenium.common.exceptions import ElementNotInteractableException from selenium.common.exceptions import NoSuchElementException from selenium.common.exceptions import ElementClickInterceptedException from selenium.common.exceptions import TimeoutException from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from time import sleep class ScrapeRemote(scrapy.Spider): name = 'jobs' start_urls = [f'https://remote.co/remote-jobs/search/?search_keywords={job_title}'] job_title = input('Enter your desired position: ').replace(' ', '+') def __init__(self): self.driver = webdriver.Chrome(r'C:\Users\leagu\chromedriver.exe') def parse(self, response): self.driver.get(response.url) try: load_more = WebDriverWait(self.driver, 10).until( EC.visibility_of_element_located((By.XPATH, '/html/body/main/div[2]/div/div[1]/div[3]/div/div/a')) ) except TimeoutException: self.log("Timeout - Couldn't load the page!") while True: try: sleep(1.5) load_more = self.driver.find_element_by_css_selector('a.load_more_jobs') load_more.click() except (ElementNotInteractableException, ElementClickInterceptedException): try: close_button = WebDriverWait(self.driver, 6).until( EC.element_to_be_clickable((By.CSS_SELECTOR, '#om-oqulaezshgjig4mgnmcn-optin > div > button')) ) close_button.click() except TimeoutException: self.log('Reached Bottom Of The Page!') break selector = scrapy.selector.Selector(text=self.driver.page_source) listings = selector.css('li.job_listing').getall() for listing in listings: selector = scrapy.selector.Selector(text=listing) position = selector.css('div.position h3::text').get() company = selector.css('div.company strong::text').get() more_information = selector.css('a::attr(href)').get() yield { 'position': position, 'company': company, 'more_information': more_information } self.driver.close()

Reinderien · Accepted Answer · 2020-05-26 20:33:05Z

Combined imports

from selenium.common.exceptions import ElementNotInteractableException from selenium.common.exceptions import NoSuchElementException from selenium.common.exceptions import ElementClickInterceptedException from selenium.common.exceptions import TimeoutException

should be

from selenium.common.exceptions import ( ElementNotInteractableException, NoSuchElementException, ElementClickInterceptedException, TimeoutException, )

Input-in-static

This:

job_title = input('Enter your desired position: ').replace(' ', '+')

creeps me out. I don't know a lot about scrapy, but see if you can initialize job_title in the constructor instead of as a static. What if this class were to be imported once and used twice, each with a different job title?

Hard-coded paths

This:

'C:\Users\leagu\chromedriver.exe'

should be pulled out into a constant, or better yet, an environmental parameter, command-line argument or configuration file parameter. Surely a user of yours who downloads your script will not be named leagu.

XPath

/html/body/main/div[2]/div/div[1]/div[3]/div/div/a

is extremely fragile and opaque. I loaded the remote.co search results, and a better selector - mind you, this is CSS and not XPath - is

div.card > div.card-body > div.card > div.card-body > a.card

You should not start from the root element, and you should attempt to use classes and IDs where possible. This markup is kind of a mess and so meaningful paths are hard to form.

Swallowing exceptions

You do this:

 except TimeoutException: self.log("Timeout - Couldn't load the page!")

but then continue with the rest of the method? Would you not want to re-throw, or at least return?

Non-guaranteed close

This:

self.driver.close()

will be skipped if there is any uncaught exception. First of all, I don't think the driver should be closed in parse, or else the class effectively can only support one invocation of parse. Implement __enter__ and __exit__, and call driver.close() in __exit__. Have the instantiator of ScrapeRemote use it in a with-block.

Thank you for dedicating your time to my code. Firstly I don't understand the term 're-throw an exception', I redid my code and now it throws and error on the first timeout, but I guess this isn't what you refered to. Secondly I don't know if it is possible to get the whole object inside a with statement when scrapy is calling the spider from the command line and doing a lot of work 'under the hood'. Here is a link to my refined code: link — Viktor Stefanov, CommentedMay 29, 2020 at 16:12
I don't understand the term 're-throw an exception' - If you write throw within an except, it rethrows the original exception. — Reinderien, CommentedMay 29, 2020 at 16:53

Stack Exchange Network

Web Scraping Dynamically Generated Content Python

1 Answer 1

Combined imports

Input-in-static

Hard-coded paths

XPath

Swallowing exceptions

Non-guaranteed close

Hot Network Questions

Web Scraping Dynamically Generated Content Python

1 Answer 1

Combined imports

Input-in-static

Hard-coded paths

XPath

Swallowing exceptions

Non-guaranteed close

Related

Hot Network Questions