Scraping website with Python and Selenium to collect data from dynamic website

Question

Summary: The code scrapes the website and collects the data to store it in CSV. It also downloads selected information that is available for download in PDF format. The details and the entire code are given below. The code has been tested several times, and it works fine. Now, I want to develop it.

About me: I am not a professional programmer. I am learning Python as an area of interest and so at amateur level. I am a stakeholder in the Indian Criminal Justice system and want to study the processes involved in it. The present code is written with this background. Discussed more in the next section.

About the project: The code presented here is the first step in the entire project to analyse various First Information Reports (FIRs) registered at different Police Stations. It further envisages to examine the details of the downloaded FIRs. The script here is designed to collect the data. The targeted website is dynamic and sometimes behaves abnormally. Some additional steps are taken to make the code robust. For instance, every time opening a new page when the inputs are changed. Such measures may increase the time taken for execution but ensure that the script does not fail.

The code: This process is being followed in the code -

Main code which contains flow and some constant variable.
Module which contains functions as part of a class.
Two loops - outer and inner
Outer loop "while" loop is based on dates
It sets the dates with time.delta start and end time. It uses global variables like a list of districts, which are used for inner loop iteration, and the website URL.
Inner for loop iterates over the element named districts. It mostly uses the module dataCollection and its class EachDistrict. It does following: 6.1 Open page, enter dates, district names, set view records to 50 per page, click search, view total number of records, identify the table with information, collect the data in CSV, search for particular cases, download the PDF for those selected cases, go to next page if available or else change the dates and start the whole process again.

Further Development: I would like to develop the code to increase performance and readability. I am trying to work on the following aspects and request all to suggest more things that can be done apart from the list mentioned here -

refactor - for readability (also performance)
use proxies
use threads (for performance)
consider using "if" instead of "try/except".

Actual Code: The entire code, including the main script and modules and the updates, is available on GitHub. I am pasting the main code and the modules here too. In the main code, there is a comment about using proxies. When I use the setting for these proxies, the site stops the access. I am looking to handle this error. I am myself working continuously on code, so the code at GitHub is the current code.

The link for GitHub is here.

""" Visit the website. enter details. take the data. """ import datetime import logging from pathlib import Path from selenium.webdriver.firefox import webdriver from selenium.webdriver.firefox.options import Options from selenium import webdriver from modules import dataCollection # constatnts main_url = r'https://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx' all_districts = ['AHMEDNAGAR', 'AKOLA', 'AMRAVATI CITY', 'AMRAVATI RURAL', 'BEED', 'BHANDARA', 'BULDHANA', 'CHANDRAPUR', 'CHHATRAPATI SAMBHAJINAGAR CITY', 'CHHATRAPATI SAMBHAJINAGAR (RURAL)', 'DHARASHIV', 'DHULE', 'GADCHIROLI', 'GONDIA', 'HINGOLI', 'JALGAON', 'JALNA', 'KOLHAPUR', 'LATUR', 'Mira-Bhayandar, Vasai-Virar Police Commissioner', 'NAGPUR CITY', 'NAGPUR RURAL', 'NANDED', 'NANDURBAR', 'NASHIK CITY', 'NASHIK RURAL', 'NAVI MUMBAI', 'PALGHAR', 'PARBHANI', 'PIMPRI-CHINCHWAD', 'PUNE CITY', 'PUNE RURAL', 'RAIGAD', 'RAILWAY MUMBAI', 'RAILWAY NAGPUR', 'RAILWAY PUNE', 'RATNAGIRI', 'SANGLI', 'SATARA', 'SINDHUDURG', 'SOLAPUR CITY', 'SOLAPUR RURAL', 'THANE CITY', 'THANE RURAL', 'WARDHA', 'WASHIM', 'YAVATMAL'] def main(): # logging logging_file = "info.log" logging_dir = Path(f'/home/sangharsh/Documents/PoA/data/FIR/Year23/logging') logging_dir.mkdir(parents=True, exist_ok=True) logger = logging.getLogger(__name__) logging.basicConfig(filename=logging_dir / logging_file, format='%(name)s:: %(levelname)s:: %(asctime)s - %(message)s', level=logging.INFO) # create console handler and set level to info ch = logging.StreamHandler() ch.setLevel(logging.WARNING) logger.addHandler(ch) start = datetime.date(2023, 1, 1) end = datetime.date(2023, 6, 30) download_dir = Path(f'/home/sangharsh/Documents/PoA/' f'data/FIR/FIR_copies/{start}_{end}') download_dir.mkdir(parents=True, exist_ok=True) options = Options() options.set_preference("browser.download.panel.shown", False) options.set_preference("browser.download.manager.showWhenStarting", False) # profile.set_preference("browser.helperApps.neverAsk.openFile","application/pdf") options.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf") options.set_preference("browser.download.folderList", 2) options.set_preference("browser.download.dir", str(download_dir)) # to go undetected options.set_preference("general.useragent.override", "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:82.0) " "Gecko/20100101 Firefox/82.0") options.set_preference("dom.webdriver.enabled", False) options.set_preference('useAutomationExtension', False) options.set_preference("pdfjs.disabled", True) # service = Service('C:\\BrowserDrivers\\geckodriver.exe') options.headless = True """options for proxy, but this is not working as if I set the proxy the connections is refused.proxy_ip = '46.4.96.137' proxy_port = '1080' options.set_preference("network.proxy.type", 1) options.set_preference( "network.proxy.http", str(proxy_ip)) options.set_preference("network.proxy.http_port", int(proxy_port)) options.set_preference("network.proxy.ssl", str(proxy_ip)) options.set_preference("network.proxy.ssl_port", str(proxy_port)) options.set_preference("network.proxy.ftp", str(proxy_ip)) options.set_preference( "network.proxy.ftp_port", int(proxy_port)) options.set_preference("network.proxy.socks", str(proxy_ip)) options.set_preference("network.proxy.socks_port", int(proxy_port)) options.set_preference( "network.http.use-cache", False)""" driver = webdriver.Firefox(options=options) # while loop for defined period using start and end dates while start < end: # create variable for to_date using time delta d2 = start + datetime.timedelta(2) # covert to string from_date = start.strftime("%d%m%Y") to_date = d2.strftime("%d%m%Y") logger.info(f'\n\n{from_date} to {to_date}\n\n') # iterate over each district for name in all_districts: each_district = dataCollection.EachDistrict(driver=driver, from_date=from_date, to_date=to_date, name=name) # open the page each_district.open_page(main_url=main_url) # enter the date each_district.enter_date() # enter name of the district each_district.district_selection() # set view records to view 50 records per page each_district.view_record() logger.info(f'\n\nName of the District: {name}\n') # click search and see if page is loaded # if not, put the district in remaining district csv # and start with new district if each_district.search(): pass else: logger.info(f"Search button didn't work with {name}." f" Going to next district\n") each_district.remaining_district() continue # check the data on each page and store if each_district.each_page(): pass else: continue start += datetime.timedelta(3) logger.info("all districts in given time frame finished finished. ") if __name__ == "__main__": main()

""" Background: module of main code. """ # imports import logging import time from pathlib import Path import pandas as pd from selenium.common import NoSuchElementException, StaleElementReferenceException, TimeoutException, \ InvalidSessionIdException from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as ec, expected_conditions from selenium.webdriver.support.ui import Select from selenium.webdriver.support.ui import WebDriverWait # A. logging logger = logging.getLogger(__name__) class EachDistrict: # class instantiation def __init__(self, driver, from_date, to_date, name): self.driver = driver self.from_date = from_date self.to_date = to_date self.name = name def open_page(self, main_url): # open the page and refresh. Without refresh, it won't work. self.driver.get(main_url) self.driver.refresh() def enter_date(self): WebDriverWait(self.driver, 30).until( ec.presence_of_element_located((By.CSS_SELECTOR, '#ContentPlaceHolder1_txtDateOfRegistrationFrom'))) from_date_field = self.driver.find_element(By.ID, "ContentPlaceHolder1_txtDateOfRegistrationFrom") to_date_field = self.driver.find_element(By.ID, "ContentPlaceHolder1_txtDateOfRegistrationTo") ActionChains(self.driver).click(from_date_field).send_keys( self.from_date).move_to_element(to_date_field).click().send_keys( self.to_date).perform() def district_selection(self): dist_list = Select(self.driver.find_element(By.CSS_SELECTOR, "#ContentPlaceHolder1_ddlDistrict")) dist_list.select_by_visible_text(self.name) def view_record(self): view = Select(self.driver.find_element(By.ID, 'ContentPlaceHolder1_ucRecordView_ddlPageSize')) view.select_by_value("50") # 6. function for click on search def search(self): # Apart from clicking on search # this function also check if the record is above 0, and page is loaded. # if the record is below 0 it adds the district remaining district list. self.driver.find_element(By.CSS_SELECTOR, '#ContentPlaceHolder1_btnSearch').click() # check if page has loaded after clicking the search buttion. Wait for 10 sec # if page not loaded, throw error and proceed to next district. try: (WebDriverWait(self.driver, 10).until( expected_conditions.text_to_be_present_in_element(( By.ID, 'ContentPlaceHolder1_gdvDeadBody_lblSrNo_0'), '1'))) logger.info("search clicked. records found") return True except (TimeoutError, NoSuchElementException, TimeoutException, StaleElementReferenceException): # add to remaining district - code to be added later. logger.info("page did not load after search", exc_info=True) return False # 7 check number of records def number_of_records(self): total_number = self.driver.find_element(By.CSS_SELECTOR, '#ContentPlaceHolder1_lbltotalrecord').text logger.info(f'\nTotal number of Cases: {total_number}') return total_number # 8 check for particular act def check_and_download(self): # check for PoA in table. # if found, click and download FIR. table = self.driver.find_element(By.ID, "ContentPlaceHolder1_gdvDeadBody") rows = table.find_elements(By.TAG_NAME, "tr") # iterate over each row for row in rows: cells = row.find_elements(By.TAG_NAME, "td") # iterate over each cell for cell in cells: cell_text = cell.text # if the act is found, count it. and take details. if "अनुसूचीत जाती आणि अनुसूचीत" in cell_text: download_link = row.find_element(By.TAG_NAME, "input") download_link.click() time.sleep(5) # logging logger.info("checking finished\n", exc_info=True) # writing data to file: # creates two sepearte files. def df_to_file(self): # get the table for cases from page # there is no need of wait actually as the page has already loaded and checked data = WebDriverWait(self.driver, 10).until(ec.presence_of_element_located(( By.CSS_SELECTOR, "#ContentPlaceHolder1_gdvDeadBody"))).get_attribute("outerHTML") all_df = pd.read_html(data) # 1. select 1st table as our intended dataframe # 2. drop last two rows as they are unnecessary # 3. drop column download as it has dyanamic link and not readable data. # 4. take df as output for next function. df_with_last_rows = all_df[0].drop(columns="Download") df = df_with_last_rows.drop(df_with_last_rows.tail(2).index) file_name = f'{self.name}_{self.from_date}_{self.to_date}.csv' dir_name = Path( f'/home/sangharsh/Documents/PoA/data/FIR/Year23/all_cases/' f'{self.from_date}_{self.to_date}') dir_name.mkdir(parents=True, exist_ok=True) df.to_csv(dir_name / file_name, index=False, mode='a', header=False) # file with cases with particular act poa_df = df[df['Sections'].str.contains("अनुसूचीत जाती आणि अनुसूचीत", na=False)] if len(poa_df.index) > 0: # while for call cases district wise file is maintained # for selected cases date wise file of all districts is maintained. poa_file = f'poa_{self.from_date}_{self.to_date}.csv' poa_dir_name = Path(f'/home/sangharsh/Documents/PoA/data/FIR/Year23/' f'poa_cases') poa_dir_name.mkdir(parents=True, exist_ok=True) poa_df.to_csv(poa_dir_name / poa_file, index=False, mode='a', header=False) else: pass def remaining_district(self): # creating a file to store district with dates where pages didn't load # it also tries to store the number of record of cases each district had dictionary = {'District': [self.name], 'from_date': str(self.from_date), 'to_date': str(self.to_date), 'number_of_record': [self.number_of_records()]} file_name = f'remaining_district_{self.from_date}_{self.to_date}.csv' dir_name = Path(f'/home/sangharsh/Documents/PoA/data/FIR/Year23/remaining_districts') dir_name.mkdir(parents=True, exist_ok=True) df = pd.DataFrame.from_dict(dictionary) df.to_csv(dir_name / file_name, mode='a', index=False, header=False) # 9 turn pages in loop and does further processing def each_page(self): # before calling next page # this function stores data on 1st page and then iterate over all pages total_number_of_records = self.number_of_records() logger.info(f"total number of records is : {total_number_of_records}" f"\np1 started") try: self.df_to_file() self.check_and_download() except (NoSuchElementException, TimeoutError, StaleElementReferenceException): self.remaining_district() logger.info("problem at p1") return False next_page_text = (f'//*[@id="ContentPlaceHolder1_gdvDeadBody"]' f'/tbody/tr[52]/td/table/tbody/tr/') i = 2 fifty = 51 while True: next_page_link = f'{next_page_text}td[{i}]/a' try: self.driver.find_element(By.XPATH, next_page_link).click() logger.info(f"p{i} clicked") except (TimeoutError, TimeoutException, InvalidSessionIdException, NoSuchElementException, StaleElementReferenceException): logger.info(f"pages finished. last page was p{i-1} ") return True try: WebDriverWait(self.driver, 10).until(expected_conditions.text_to_be_present_in_element( (By.ID, 'ContentPlaceHolder1_gdvDeadBody_lblSrNo_0'), f'{str(fifty)}')) logger.info(f"p{i} loaded") # check the act and download the copy logger.info("checking and downloading copies") self.df_to_file() self.check_and_download() except (TimeoutError, TimeoutException, InvalidSessionIdException, NoSuchElementException, StaleElementReferenceException): # close the driver. last_page = i-1 logger.warning(f" problem @ p{last_page}", exc_info=True) self.remaining_district() return False # for going to next page and checking if next page is loaded: i += 1 fifty += 50

Madagascar · Accepted Answer · 2024-04-10 13:13:41Z

Be judicious with the use of comments:

# create console handler and set level to info ch = logging.StreamHandler() ch.setLevel(logging.WARNING)

This one claims that the level is set to "info", but I see "logging.WARNING" instead? The comment is not just redundant, it is misleading.

# while loop for defined period using start and end dates while start < end: # create variable for to_date using time delta d2 = start + datetime.timedelta(2) # covert to string from_date = start.strftime("%d%m%Y") to_date = d2.strftime("%d%m%Y") logger.info(f'\n\n{from_date} to {to_date}\n\n') # iterate over each district for name in all_districts:

Yes, one can see from while that it is a while loop. Why do you find it necessary to comment it?

The rest of the comments add nothing useful either, and should be elided.

# open the page each_district.open_page(main_url=main_url) # enter the date each_district.enter_date() # enter name of the district each_district.district_selection()

Here you are just repeating what the names of the function say very explicitly. open_page() does not need a description.

# set view records to view 50 records per page each_district.view_record()

I did not understand this comment. Is 50 perhaps the default behavior? Or is the comment older than the code?

# A. logging logger = logging.getLogger(__name__)

Um..? There are three instances of log in the line following the comment. This wasn't necessary either.

# class instantiation def __init__(self, driver, from_date, to_date, name):

This is not needed either.

# open the page and refresh. Without refresh, it won't work. self.driver.get(main_url) self.driver.refresh()

The second part of the comment is better. It actually says something meaningful. But it doesn't state why. Why won't it work without refreshing?

# Apart from clicking on search # this function also check if the record is above 0, and page is loaded. # if the record is below 0 it adds the district remaining district list.

This would serve better as a docstring.

# iterate over each row for row in rows: cells = row.find_elements(By.TAG_NAME, "td") # iterate over each cell for cell in cells:

Again, you don't need to explain that loops iterate.

In short, the program suffers from over-commenting. Most of comments are either superfluous, misleading, or ambiguous, and should be elided.

Also consider using isort to sort the imports automatically.

\$\begingroup\$Edited the code and updated the git file. Many thanks.\$\endgroup\$
– sangharsh
CommentedApr 10, 2024 at 14:42 — sangharsh, CommentedApr 10, 2024 at 14:42

Stack Exchange Network

Scraping website with Python and Selenium to collect data from dynamic website

1 Answer 1

Be judicious with the use of comments:

Hot Network Questions

Scraping website with Python and Selenium to collect data from dynamic website

1 Answer 1

Be judicious with the use of comments:

Related

Hot Network Questions