3
\$\begingroup\$

This is a small web scraping project I made in 2 hours that targets the website remote.co . I am looking forward for improvements in my code. I know about the inconsistency with the WebDriverWait and time.sleep() waits, but when I used the WebDriverWait to wait until the load_more button was clickable and ran the program selenium crashed my webdriver window and continuously spammed my terminal window with 20-30 lines of seemingly useless text.

import scrapy from selenium import webdriver from selenium.common.exceptions import ElementNotInteractableException from selenium.common.exceptions import NoSuchElementException from selenium.common.exceptions import ElementClickInterceptedException from selenium.common.exceptions import TimeoutException from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from time import sleep class ScrapeRemote(scrapy.Spider): name = 'jobs' start_urls = [f'https://remote.co/remote-jobs/search/?search_keywords={job_title}'] job_title = input('Enter your desired position: ').replace(' ', '+') def __init__(self): self.driver = webdriver.Chrome(r'C:\Users\leagu\chromedriver.exe') def parse(self, response): self.driver.get(response.url) try: load_more = WebDriverWait(self.driver, 10).until( EC.visibility_of_element_located((By.XPATH, '/html/body/main/div[2]/div/div[1]/div[3]/div/div/a')) ) except TimeoutException: self.log("Timeout - Couldn't load the page!") while True: try: sleep(1.5) load_more = self.driver.find_element_by_css_selector('a.load_more_jobs') load_more.click() except (ElementNotInteractableException, ElementClickInterceptedException): try: close_button = WebDriverWait(self.driver, 6).until( EC.element_to_be_clickable((By.CSS_SELECTOR, '#om-oqulaezshgjig4mgnmcn-optin > div > button')) ) close_button.click() except TimeoutException: self.log('Reached Bottom Of The Page!') break selector = scrapy.selector.Selector(text=self.driver.page_source) listings = selector.css('li.job_listing').getall() for listing in listings: selector = scrapy.selector.Selector(text=listing) position = selector.css('div.position h3::text').get() company = selector.css('div.company strong::text').get() more_information = selector.css('a::attr(href)').get() yield { 'position': position, 'company': company, 'more_information': more_information } self.driver.close() 
\$\endgroup\$

    1 Answer 1

    3
    \$\begingroup\$

    Combined imports

    from selenium.common.exceptions import ElementNotInteractableException from selenium.common.exceptions import NoSuchElementException from selenium.common.exceptions import ElementClickInterceptedException from selenium.common.exceptions import TimeoutException 

    should be

    from selenium.common.exceptions import ( ElementNotInteractableException, NoSuchElementException, ElementClickInterceptedException, TimeoutException, ) 

    Input-in-static

    This:

    job_title = input('Enter your desired position: ').replace(' ', '+') 

    creeps me out. I don't know a lot about scrapy, but see if you can initialize job_title in the constructor instead of as a static. What if this class were to be imported once and used twice, each with a different job title?

    Hard-coded paths

    This:

    'C:\Users\leagu\chromedriver.exe' 

    should be pulled out into a constant, or better yet, an environmental parameter, command-line argument or configuration file parameter. Surely a user of yours who downloads your script will not be named leagu.

    XPath

    /html/body/main/div[2]/div/div[1]/div[3]/div/div/a 

    is extremely fragile and opaque. I loaded the remote.co search results, and a better selector - mind you, this is CSS and not XPath - is

    div.card > div.card-body > div.card > div.card-body > a.card 

    You should not start from the root element, and you should attempt to use classes and IDs where possible. This markup is kind of a mess and so meaningful paths are hard to form.

    Swallowing exceptions

    You do this:

     except TimeoutException: self.log("Timeout - Couldn't load the page!") 

    but then continue with the rest of the method? Would you not want to re-throw, or at least return?

    Non-guaranteed close

    This:

    self.driver.close() 

    will be skipped if there is any uncaught exception. First of all, I don't think the driver should be closed in parse, or else the class effectively can only support one invocation of parse. Implement __enter__ and __exit__, and call driver.close() in __exit__. Have the instantiator of ScrapeRemote use it in a with-block.

    \$\endgroup\$
    2
    • \$\begingroup\$Thank you for dedicating your time to my code. Firstly I don't understand the term 're-throw an exception', I redid my code and now it throws and error on the first timeout, but I guess this isn't what you refered to. Secondly I don't know if it is possible to get the whole object inside a with statement when scrapy is calling the spider from the command line and doing a lot of work 'under the hood'. Here is a link to my refined code: link\$\endgroup\$CommentedMay 29, 2020 at 16:12
    • \$\begingroup\$I don't understand the term 're-throw an exception' - If you write throw within an except, it rethrows the original exception.\$\endgroup\$CommentedMay 29, 2020 at 16:53

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.