4
\$\begingroup\$

I'm attempting to scrape data.cdc.gov for their COVID-19 information on cases and deaths.

The problem that I'm having is that the code seems to be very inefficient. It takes an extremely long time for the code to work. For some reason the CDC's XML file doesn't work at all, and the API is incomplete. I need all of the information about Covid-19 starting from January 22, 2020, up until now. However, the API just doesn't contain all of the information for all of those days. Please someone assist me in making this code more efficient so that I can more seamlessly extract the information that I need.

from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By import time options = Options() options.add_argument('--no-sandbox') url = 'https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36/data' driver = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\chromedriver.exe",options=options) driver.implicitly_wait(10) driver.get(url) while True: rows = driver.find_elements_by_xpath("//div[contains(@class, 'socrata-table frozen-columns')]") covid_fin = [] for table in rows: headers = [] for head in table.find_elements_by_xpath('//*[@id="renderTypeContainer"]/div[4]/div[2]/div/div[4]/div[1]/div/table/thead/tr/th'): headers.append(head.text) for row in table.find_elements_by_xpath('//*[@id="renderTypeContainer"]/div[4]/div[2]/div/div[4]/div[1]/div/table/tbody/tr'): covid = [] for col in row.find_elements_by_xpath("./*[name()='td']"): covid.append(col.text) if covid: covid_dict = {headers[i]: covid[i] for i in range(len(headers))} covid_fin.append(covid_dict) try: WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, 'pager-button-next'))).click() time.sleep(5) except: break 
\$\endgroup\$
7
  • \$\begingroup\$@Mast okay, so I just uninstalled version 3.9.1, downloaded 3.10.1, and restarted my computer. Lastly I just reopened Jupyter and reran my code. And it's working the same way. Do you have any other tips? Thanks again for reminding me to update in any event.\$\endgroup\$
    – Nini
    CommentedDec 9, 2021 at 17:00
  • \$\begingroup\$You're running this in a Jupyter notebook? All in the same code-block?\$\endgroup\$
    – Mast
    CommentedDec 9, 2021 at 17:01
  • \$\begingroup\$@Mast yes I am.\$\endgroup\$
    – Nini
    CommentedDec 9, 2021 at 17:06
  • \$\begingroup\$@Mast I just separated the loop out of the first part of the code into a different cell and ran it. It's still running slowly.\$\endgroup\$
    – Nini
    CommentedDec 9, 2021 at 17:12
  • 2
    \$\begingroup\$You might be interested in the process, with Python code included, described at Federal COVID Data in a Single Stream, as there is some somewhat complex nuance to working with the CDC data and matching it up to other federal datasets, like COVID testing and hospitalizations.\$\endgroup\$CommentedDec 10, 2021 at 3:09

2 Answers 2

4
\$\begingroup\$

In my opinion, Selenium isn't the right tool for web scraping much (probably, most) of the time. It turns out that even when websites use javascript, you can usually figure out what that js is doing by using your browser's inspect network.

If you open inspector (ctrl-shift-I in Chrome), then open the initial url you'll see all these requests with the preview to the right. One trick is to just click on all the requests looking at the preview until you see something that looks like the data you want. The first "data" page turns out not to have any data.

inspector1

If you go down a little ways, you'll find the data.

inspector2

Once you find the data, go back to the Headers of the inspector where you can get the URL of the data.

inspector3

Let's copy and paste that into a script

dataurl="https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid%20limit%20100" 

Now, on the site, let's click Next and see what happens (well I already did that before doing the screenshots so you can see what happened next already). If you get the URLs from those requests you'll start to see a pattern...

dataurl= "https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid%20limit%20100" dataurl2="https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid%20offset%20100%20limit%20100" dataurl3="https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid%20offset%20200%20limit%20100" 

In the first one, there is a select with some jibberish followed by a limit of 100. In the next ones, that select jibberish and the limit of 100 stayed the same by now there's an offset. Now we can just do...

import pandas as pd import requests df=[] i=0 while True: if i==0: offset="" else: offset=f"%20offset%20{i}00" url=f"https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid{offset}%20limit%20100" temp=pd.read_json(requests.get(url).text) if temp.shape[0]>0: df.append(pd.read_json(requests.get(url).text)) i+=1 else: break df=pd.concat(df) 

On my computer, this ran in about 4min.

\$\endgroup\$
6
  • \$\begingroup\$This worked. And I'm not sure how I'm supposed to feel about this. I've been working so hard on this. Your code is so much shorter. Thank you.\$\endgroup\$
    – Nini
    CommentedDec 9, 2021 at 20:20
  • \$\begingroup\$Interesting. Do you think the site could be vulnerable to SQL injections ? o:)\$\endgroup\$
    – Kate
    CommentedDec 9, 2021 at 20:42
  • \$\begingroup\$@Dean MacGregor, if you don't mind. Could you please share a link to some documentation containing information about your process?\$\endgroup\$
    – Nini
    CommentedDec 9, 2021 at 22:01
  • 3
    \$\begingroup\$"guess, check, google, repeat" I have nothing more substantive than that\$\endgroup\$CommentedDec 9, 2021 at 22:05
  • \$\begingroup\$Okay. Thanks for your help.\$\endgroup\$
    – Nini
    CommentedDec 10, 2021 at 1:24
4
\$\begingroup\$

Don't scrape. Delete all of your code. Go to that page and download one of the export types. XML is richer and has more fields, but CSV is more compact.

\$\endgroup\$
3
  • 2
    \$\begingroup\$Ha. Until I saw your answer I assumed the export button didn't work for some reason.\$\endgroup\$CommentedDec 9, 2021 at 19:36
  • \$\begingroup\$@Reinderien I'm scraping to show my capability to do so for my portfolio. And the XML is corrupted.\$\endgroup\$
    – Nini
    CommentedDec 9, 2021 at 19:40
  • 5
    \$\begingroup\$In my opinion, scraping something that is a poor use case for scraping is not a great portfolio entry. There are plenty of other sites that require scraping that would be a better fit.\$\endgroup\$CommentedDec 9, 2021 at 19:47

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.