7
\$\begingroup\$

Question

Recently I have tried to create a web scraping program to get data from Google Trends. It uses an RSS feed to do so. My question is as follows:

How can I improve my code such that it is more concise, efficient, and pleasing to a programmers eye? I would like to emphasise that I'm most concerned about if I've misused any functions, methods, or syntax in my code and if it could be 'cleaner'

Context

  • I'm not experienced at all. I'm doing this to learn.
  • Security isn't an issue since the program is running locally.

The code

"""Uses webscraping to retrieve html information from Google Trends for parsing""" # Imports import time import ssl from requests import get from bs4 import BeautifulSoup # Url for request url = "https://trends.google.com/trends/trendingsearches/daily/rss?geo=US" # Used to create an unverified certificate. ssl._create_default_https_context = ssl._create_unverified_context class connection(object): def __init__(self, f): self.f = f def __call__(self): """Calculates the time taken to complete the request without any errors""" try: # Times the amount of time the function takes to complete global html start = time.time() html = get(url) self.f() end = time.time() time_taken = end - start result = print("the function {} took {} to connect".format(self.f.__name__, time_taken)) return result except: print("function {} failed to connect".format(self.f.__name__)) @connection def html_parser(): """Parses the html into storable text""" html_unparsed = html.text soup = BeautifulSoup(html_unparsed, "html.parser") titles_unformatted = soup.find_all("title") traffic_unformatted = soup.find_all("ht:approx_traffic") # Indexes through list to make data readable titles = [x.text for x in titles_unformatted] traffic = [] for x in traffic_unformatted: x = x.text x = x.replace("+","") x = x.replace(",", "") traffic.append(x) print(traffic, titles) html_parser() 

Output

['100000', '2000000', '2000000', '1000000', '1000000', '1000000', '1000000', '500000', '500000', '500000', '200000', '200000', '200000', '200000', '200000', '200000', '200000', '200000', '200000', '200000'] ['Daily Search Trends', '23 and Me', 'NFL scores', 'Best Buy', 'Walmart Supercenter', 'NFL', 'Saints', 'GameStop', 'JCPenney', 'Lion King 2019', 'Starbucks', 'Dollar General', 'Amazon Black Friday', 'Mike Posner', 'NFL games On Thanksgiving', 'McDonalds', 'Bath And Body Works Black Friday', 'Old Navy Black Friday 2018', 'Kroger', 'NFL standings', 'Safeway'] the function html_parser took 1.0186748504638672 to connect 

Concerns

  • Makes programmers cringe.

As someone relatively new to python and programming in general my worst fear is that this code gives someone else a headache or a laugh to look at. At the end of the day I just want to improve, so to reiterate: how can I make this code look better? Have I misused any functions?

\$\endgroup\$
1
  • 1
    \$\begingroup\$Security is an issue if you're reading data from the Internet and processing it. Flaws in interpreting untrustworthy data have resulted in attacks on Web browser image parsing, for example.\$\endgroup\$CommentedMar 22, 2022 at 16:09

1 Answer 1

8
\$\begingroup\$

The idea behind your script is pretty cool and there's a couple ways to polish it up! While it certainly wouldn't cause a headache, I think that creating an entire class just to measure how long the request is going to take to complete is a little overkill.

Within the class itself, let's look at 2 things which aren't the best practice.

  • Use of a global variable
  • Use of the try/except block

Global variables

There's a brilliant StackOverflow question with really good answers as to why using global variables should be discouraged. While in the scope of your script it is not harmful, imagine you had many functions and methods in your script. Using the global variable could decrease readability and potentially alter the behaviour of your code. Mind you, global variables aren't the same as global constants (also discussed in the StackOverflow question).

Exception handling

Using a bare except statement will handle all exceptions, even the ones you might not anticipate. Catch errors explicitly, otherwise you might cause unexpected behaviour in your code, should it become more complex. In this scope, your bare except statement should become more like this:

try: ... except requests.exceptions.ConnectionError: print("failed to connect") 

This way, you are handling the case where your request actually fails and you let other errors happen in their intended way. Otherwise, a different exception could have been raised and you won't know what caused it.

Additionally, on this class:

  • Following naming conventions, class names should use CapWords.
  • In Python 3+, classes no longer need to be inherited from object.

Instead of the class, you could just define a simple function which retrieves the XML document from the RSS feed, declare your timing functionality within it and returns the XML document. Seeing as the URL query takes in what's seemingly an ISO 3166-2 country code as one of the parameters, you could pass the country code into this function and have the ability to fetch search trends for so many different countries! Please note that in the following snippet I am using f-strings which were introduced since Python 3.6.

def fetch_xml(country_code): url = f"https://trendse.google.com/trends/trendingsearches/daily/rss?geo={country_code}" start = time.time() response = requests.get(url) response_time = time.time() - start print(f"The request took {response_time}s to complete.") return response.content` 

Parser

Looking at the docstring of your parser, one might argue it's a little misleading. You're mentioning "storable text" while you're only printing your output! To come closer to the intention of the docstring, it would be better to return the parsed data and decide what to do with it later. I think in this case, a dictionary would a very fitting data structure. Dictionaries store key-value pairs, so lets use the trend title as the key and the traffic approximation as the value.

Using re.sub() will allow you to remove different characters from a string in a single line, rather than replacing each character separately! This piece of code here:

for x in traffic_unformatted: x = x.text x = x.replace("+","") x = x.replace(",", "") 

Can now become:

 for x in traffic_unformatted: x = re.sub("[+,]", "", x.text) 

Zip the two lists of titles and traffic approximations together and iterate over them to create the dictionary. I am going to use a dictionary comprehension in the final script, however it works identically to the following.

titles = soup.find_all("title") approximate_traffic = soup.find_all("ht:approx_traffic") trends = {} for title, traffic in zip(titles[1:], approximate_traffic): trends[title.text] = re.sub("[+,]", "", traffic.text) return trends 

Please pay attention to the titles function having its first element sliced off - this is because BeautifulSoup picked up the title of the entire XML document together with titles of the trends.

Finally, I am using an if __name__ == "__main__" guard to separate functions from the block of code where I am calling the functions from.

Entire script:

import time import re import requests from bs4 import BeautifulSoup def fetch_xml(country_code): url = f"https://trends.google.com/trends/trendingsearches/daily/rss?geo={country_code}" start = time.time() response = requests.get(url) response_time = time.time() - start print(f"The request took {response_time}s to complete.") return response.content def trends_retriever(country_code): """Parses the Google Trends RSS feed using BeautifulSoup. Returns: dict: Trend title for key, trend approximate traffic for value. """ xml_document = fetch_xml(country_code) soup = BeautifulSoup(xml_document, "lxml") titles = soup.find_all("title") approximate_traffic = soup.find_all("ht:approx_traffic") return {title.text: re.sub("[+,]", "", traffic.text) for title, traffic in zip(titles[1:], approximate_traffic)} if __name__ == '__main__': trends = trends_retriever("US") print(trends) 

Output:

The request took 0.2512788772583008s to complete. {'Tiger vs Phil': '2000000', 'McKenzie Milton': '200000', 'Costco Black Friday': '200000', 'Nebraska Football': '200000', 'Patagonia': '100000', 'Michael Kors': '100000', 'Jcrew': '100000', 'finish line': '100000', 'WVU football': '100000', 'Texas football': '100000', 'Shoe Carnival': '100000', 'J Crew': '100000', 'Llbean': '100000', 'Cards Against Humanity': '100000', 'Bleacher Report': '100000', 'Steph Curry': '50000', 'Apple Cup': '50000', 'Bob McNair': '50000', 'Virginia Tech football': '50000', 'Glossier': '50000'} 
\$\endgroup\$

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.