I'm attempting to scrape data.cdc.gov for their COVID-19 information on cases and deaths.
The problem that I'm having is that the code seems to be very inefficient. It takes an extremely long time for the code to work. For some reason the CDC's XML file doesn't work at all, and the API is incomplete. I need all of the information about Covid-19 starting from January 22, 2020, up until now. However, the API just doesn't contain all of the information for all of those days. Please someone assist me in making this code more efficient so that I can more seamlessly extract the information that I need.
from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By import time options = Options() options.add_argument('--no-sandbox') url = 'https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36/data' driver = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\chromedriver.exe",options=options) driver.implicitly_wait(10) driver.get(url) while True: rows = driver.find_elements_by_xpath("//div[contains(@class, 'socrata-table frozen-columns')]") covid_fin = [] for table in rows: headers = [] for head in table.find_elements_by_xpath('//*[@id="renderTypeContainer"]/div[4]/div[2]/div/div[4]/div[1]/div/table/thead/tr/th'): headers.append(head.text) for row in table.find_elements_by_xpath('//*[@id="renderTypeContainer"]/div[4]/div[2]/div/div[4]/div[1]/div/table/tbody/tr'): covid = [] for col in row.find_elements_by_xpath("./*[name()='td']"): covid.append(col.text) if covid: covid_dict = {headers[i]: covid[i] for i in range(len(headers))} covid_fin.append(covid_dict) try: WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, 'pager-button-next'))).click() time.sleep(5) except: break