Web scraping data.cdc.gov for COVID-19 Data with Selenium in Python

Question

I'm attempting to scrape data.cdc.gov for their COVID-19 information on cases and deaths.

The problem that I'm having is that the code seems to be very inefficient. It takes an extremely long time for the code to work. For some reason the CDC's XML file doesn't work at all, and the API is incomplete. I need all of the information about Covid-19 starting from January 22, 2020, up until now. However, the API just doesn't contain all of the information for all of those days. Please someone assist me in making this code more efficient so that I can more seamlessly extract the information that I need.

from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By import time options = Options() options.add_argument('--no-sandbox') url = 'https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36/data' driver = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\chromedriver.exe",options=options) driver.implicitly_wait(10) driver.get(url) while True: rows = driver.find_elements_by_xpath("//div[contains(@class, 'socrata-table frozen-columns')]") covid_fin = [] for table in rows: headers = [] for head in table.find_elements_by_xpath('//*[@id="renderTypeContainer"]/div[4]/div[2]/div/div[4]/div[1]/div/table/thead/tr/th'): headers.append(head.text) for row in table.find_elements_by_xpath('//*[@id="renderTypeContainer"]/div[4]/div[2]/div/div[4]/div[1]/div/table/tbody/tr'): covid = [] for col in row.find_elements_by_xpath("./*[name()='td']"): covid.append(col.text) if covid: covid_dict = {headers[i]: covid[i] for i in range(len(headers))} covid_fin.append(covid_dict) try: WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, 'pager-button-next'))).click() time.sleep(5) except: break

@Mast okay, so I just uninstalled version 3.9.1, downloaded 3.10.1, and restarted my computer. Lastly I just reopened Jupyter and reran my code. And it's working the same way. Do you have any other tips? Thanks again for reminding me to update in any event. — Nini, CommentedDec 9, 2021 at 17:00
You're running this in a Jupyter notebook? All in the same code-block? — Mast, CommentedDec 9, 2021 at 17:01
@Mast I just separated the loop out of the first part of the code into a different cell and ran it. It's still running slowly. — Nini, CommentedDec 9, 2021 at 17:12
You might be interested in the process, with Python code included, described at Federal COVID Data in a Single Stream, as there is some somewhat complex nuance to working with the CDC data and matching it up to other federal datasets, like COVID testing and hospitalizations. — Zach Lipton, CommentedDec 10, 2021 at 3:09

Dean MacGregor · Accepted Answer · 2021-12-09 19:34:17Z

In my opinion, Selenium isn't the right tool for web scraping much (probably, most) of the time. It turns out that even when websites use javascript, you can usually figure out what that js is doing by using your browser's inspect network.

If you open inspector (ctrl-shift-I in Chrome), then open the initial url you'll see all these requests with the preview to the right. One trick is to just click on all the requests looking at the preview until you see something that looks like the data you want. The first "data" page turns out not to have any data.

If you go down a little ways, you'll find the data.

Once you find the data, go back to the Headers of the inspector where you can get the URL of the data.

Let's copy and paste that into a script

dataurl="https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid%20limit%20100"

Now, on the site, let's click Next and see what happens (well I already did that before doing the screenshots so you can see what happened next already). If you get the URLs from those requests you'll start to see a pattern...

dataurl= "https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid%20limit%20100" dataurl2="https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid%20offset%20100%20limit%20100" dataurl3="https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid%20offset%20200%20limit%20100"

In the first one, there is a select with some jibberish followed by a limit of 100. In the next ones, that select jibberish and the limit of 100 stayed the same by now there's an offset. Now we can just do...

import pandas as pd import requests df=[] i=0 while True: if i==0: offset="" else: offset=f"%20offset%20{i}00" url=f"https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid{offset}%20limit%20100" temp=pd.read_json(requests.get(url).text) if temp.shape[0]>0: df.append(pd.read_json(requests.get(url).text)) i+=1 else: break df=pd.concat(df)

On my computer, this ran in about 4min.

This worked. And I'm not sure how I'm supposed to feel about this. I've been working so hard on this. Your code is so much shorter. Thank you. — Nini, CommentedDec 9, 2021 at 20:20
Interesting. Do you think the site could be vulnerable to SQL injections ? o:) — Kate, CommentedDec 9, 2021 at 20:42
@Dean MacGregor, if you don't mind. Could you please share a link to some documentation containing information about your process? — Nini, CommentedDec 9, 2021 at 22:01
"guess, check, google, repeat" I have nothing more substantive than that — Dean MacGregor, CommentedDec 9, 2021 at 22:05

Reinderien · Accepted Answer · 2021-12-09 19:25:40Z

4

Don't scrape. Delete all of your code. Go to that page and download one of the export types. XML is richer and has more fields, but CSV is more compact.

answered Dec 9, 2021 at 19:25

Reinderien

68.9k5 gold badges74 silver badges237 bronze badges

2
\$\begingroup\$Ha. Until I saw your answer I assumed the export button didn't work for some reason.\$\endgroup\$
– Dean MacGregor
CommentedDec 9, 2021 at 19:36
\$\begingroup\$@Reinderien I'm scraping to show my capability to do so for my portfolio. And the XML is corrupted.\$\endgroup\$
– Nini
CommentedDec 9, 2021 at 19:40
5
\$\begingroup\$In my opinion, scraping something that is a poor use case for scraping is not a great portfolio entry. There are plenty of other sites that require scraping that would be a better fit.\$\endgroup\$
– Reinderien
CommentedDec 9, 2021 at 19:47

Add a comment |

Stack Exchange Network

Web scraping data.cdc.gov for COVID-19 Data with Selenium in Python

2 Answers 2

Hot Network Questions

Web scraping data.cdc.gov for COVID-19 Data with Selenium in Python

2 Answers 2

Related

Hot Network Questions