Question
Recently I have tried to create a web scraping program to get data from Google Trends. It uses an RSS feed to do so. My question is as follows:
How can I improve my code such that it is more concise, efficient, and pleasing to a programmers eye? I would like to emphasise that I'm most concerned about if I've misused any functions, methods, or syntax in my code and if it could be 'cleaner'
Context
- I'm not experienced at all. I'm doing this to learn.
- Security isn't an issue since the program is running locally.
The code
"""Uses webscraping to retrieve html information from Google Trends for parsing""" # Imports import time import ssl from requests import get from bs4 import BeautifulSoup # Url for request url = "https://trends.google.com/trends/trendingsearches/daily/rss?geo=US" # Used to create an unverified certificate. ssl._create_default_https_context = ssl._create_unverified_context class connection(object): def __init__(self, f): self.f = f def __call__(self): """Calculates the time taken to complete the request without any errors""" try: # Times the amount of time the function takes to complete global html start = time.time() html = get(url) self.f() end = time.time() time_taken = end - start result = print("the function {} took {} to connect".format(self.f.__name__, time_taken)) return result except: print("function {} failed to connect".format(self.f.__name__)) @connection def html_parser(): """Parses the html into storable text""" html_unparsed = html.text soup = BeautifulSoup(html_unparsed, "html.parser") titles_unformatted = soup.find_all("title") traffic_unformatted = soup.find_all("ht:approx_traffic") # Indexes through list to make data readable titles = [x.text for x in titles_unformatted] traffic = [] for x in traffic_unformatted: x = x.text x = x.replace("+","") x = x.replace(",", "") traffic.append(x) print(traffic, titles) html_parser()
Output
['100000', '2000000', '2000000', '1000000', '1000000', '1000000', '1000000', '500000', '500000', '500000', '200000', '200000', '200000', '200000', '200000', '200000', '200000', '200000', '200000', '200000'] ['Daily Search Trends', '23 and Me', 'NFL scores', 'Best Buy', 'Walmart Supercenter', 'NFL', 'Saints', 'GameStop', 'JCPenney', 'Lion King 2019', 'Starbucks', 'Dollar General', 'Amazon Black Friday', 'Mike Posner', 'NFL games On Thanksgiving', 'McDonalds', 'Bath And Body Works Black Friday', 'Old Navy Black Friday 2018', 'Kroger', 'NFL standings', 'Safeway'] the function html_parser took 1.0186748504638672 to connect
Concerns
- Makes programmers cringe.
As someone relatively new to python and programming in general my worst fear is that this code gives someone else a headache or a laugh to look at. At the end of the day I just want to improve, so to reiterate: how can I make this code look better? Have I misused any functions?