3
\$\begingroup\$

I'm relatively new to python and for an assignment I had to write a program that fetches a webpage with BeautifulSoup, extract all Paragraphs from it, and extract all words ending with "ing", and in the end save it to a file with the format "Word" + tab + "wordcount" + "newline".

This is my code so far. Is there a more pythonic way to handle this? Or generally ways to improve the code?

from bs4 import BeautifulSoup import requests import re def main(): site = "https://en.wikipedia.org/wiki/Data_science" r = requests.get(site).content soup = BeautifulSoup(r) ps = soup.findAll('p') fulltext = '' for p in ps: fulltext += p.get_text() words = match_words(fulltext) formated_words = sort_and_format(words) with open(r"Q1_Part1.txt","w") as file: file.write(formated_words) def match_words(string): pattern = re.compile(r'\b(\w*ing)\b') words = re.findall(pattern, string.lower()) matching_words = {} for word in words: if word in matching_words: matching_words[word] += 1 else: matching_words[word] = 1 return matching_words def sort_and_format(dict): ordered_keys = sorted(dict, key=dict.get, reverse=True) output_string = '' for r in ordered_keys: output_string += f"{r}\t{dict[r]}\n" return output_string main() 
\$\endgroup\$

    1 Answer 1

    3
    \$\begingroup\$
    if word in matching_words: matching_words[word] += 1 else: matching_words[word] = 1 

    If you're checking if a dictionary has a key before adding to it, a defaultdict may be a better option:

    from collections import defaultdict matching_words = defaultdict(int) matching_words[word] += 1 

    int returns a 0 when called without arguments, and that 0 is used as a default value for the dictionary when the key doesn't exist.


    fulltext = '' for p in ps: fulltext += p.get_text() 

    This isn't very efficient. Performance of += on strings has gotten better in later versions of Python, but it's still generally slower. The typical alternative is using join:

    pieces = [p.get_text() for p in ps] fulltext = "".join(pieces) # Or just fulltext = "".join([p.get_text() for p in ps]) 

    Then similarly in sort_and_format:

    output_string = "".join([f"{r}\t{dict[r]}\n"] for r in ordered_keys]) 

    In sort_and_format, you've named the parameter dict. This is suboptimal for a couple reasons:

    • dict is a generic name that doesn't properly describe the data.
    • dict is the name of a built-in class, and shadowing it makes your code more confusing, and prevents you from using the built-in.

    Indicating the type can be helpful though, so I might introduce type hints here

    from typing import Dict def sort_and_format(words: Dict[str, int]) -> str: . . . 

    This says that the functions accepts a Dictionary mapping strings to ints, and returns a string


    Also for sort_and_format, I've found that when you start sticking and into names, that can suggest that the function is doing too much. You may find that the code will make more sense if the sorting and formatting happen separately. That functions can handle purely formatting, and can be handed a sequence to work with instead. If that sequence is sorted, great, if not, also great. It doesn't matter for the purposes of formatting what the sort order is.

    \$\endgroup\$
    4
    • \$\begingroup\$Thank you for the insights. Everything was outstandingly helpfull. The list comprehension is so usefull in python data wrangling. Also what do you think is it good practice in Python to specify the inputs and outputs? I come from java and scala and this is one thing which i feel is off and came to bite me in the ass lots of times.\$\endgroup\$
      – elauser
      CommentedSep 21, 2020 at 0:03
    • \$\begingroup\$@elauser No problem. Glad to help. What do you mean though by "Also what do you think is it good practice in Python to specify the inputs and outputs?". The types of the input/output?\$\endgroup\$CommentedSep 21, 2020 at 0:10
    • \$\begingroup\$Parameters and return types in functions is what i meant.\$\endgroup\$
      – elauser
      CommentedSep 21, 2020 at 1:30
    • 1
      \$\begingroup\$@elauser I recommend using type hints, and also using an IDE like Pycharm that gives type warnings. I find it helps a lot, but it means you need to take the time to consider what types are involved (although that isn't necessarily a bad thing). If you check the Python code that I've posted here for review, you'll see that I use them everywhere.\$\endgroup\$CommentedSep 21, 2020 at 1:34

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.