Scrape a URL for articles to save

Question

Below is an exercise project I was doing on a certain educational site. It is supposed to parse a given (static address in this example) URL for html data, search articles of a given type there and then save their text data on your drive.

It is working fine but since there're no code reviews on this site, I'm sort of worried that I'm doing something wrong without understanding it. Specifically, I'm struggling to understand is which cases or to what aim I should use classes in Python.

What I want to ask is for someone to tell if the below code is a valid application of OOP (and if OOP approach should even be used for such task). And if not, why and what I should be looking for to improve.

import string import os import requests from dataclasses import dataclass, field from typing import ClassVar from bs4 import BeautifulSoup as beauty @dataclass class WebScraper: CACHE: ClassVar[dict] = dict() BASE_URL: ClassVar[str] = 'https://www.nature.com' PAGE_URL: ClassVar[str] = 'https://www.nature.com/nature/articles?sort=PubDate&year=2020' page_number: int art_type: str raw_text: str = field(init=False) art_urls: dict = field(init=False) page_content: dict = field(init=False) @classmethod def save_content(cls): for page, articles in cls.CACHE.items(): os.mkdir(page) for name, text in articles.items(): path = os.path.join(page, f'{name}.txt') with open(path, 'wt', encoding='utf-8') as file: file.write(text) def __post_init__(self): self.page_content = dict() url = self.PAGE_URL + f"&page{self.page_number}" self.raw_text = requests.get(url, headers={'Accept-Language': 'en-US,en;q=0.5'}).text self.art_urls = dict() self.parse_page() def parse_page(self): soup = beauty(self.raw_text, 'html.parser') articles = soup.find_all('article') for art in articles: span = art.find('span', {'class': 'c-meta__type'}) if span.text == self.art_type: link = art.find('a') url = self.BASE_URL + link['href'] self.art_urls[link.text] = url def parse_articles(self): for name, url in self.art_urls.items(): r = requests.get(url, headers={'Accept-Language': 'en-US,en;q=0.5'}) soup = beauty(r.text, 'html.parser') article = soup.find('div', {'class': 'c-article-body'}) file_name = ''.join(ch if ch not in string.punctuation else '' for ch in name) file_name = file_name.replace(' ', '_') self.page_content[file_name] = article.get_text() def cache_content(self): self.CACHE[f'Page_{self.page_number}'] = self.page_content n_of_pages = int(input()) art_type = input() for page_num in range(1, n_of_pages + 1): scraper = WebScraper(page_num, art_type) scraper.parse_articles() scraper.cache_content() WebScraper.save_content() ```

\$\begingroup\$It's Scraper, not Scrapper\$\endgroup\$
– Reinderien
CommentedFeb 11, 2022 at 22:17 — Reinderien, CommentedFeb 11, 2022 at 22:17
\$\begingroup\$Okay, amended. :D\$\endgroup\$
– Gavin Greenhorn
CommentedFeb 12, 2022 at 11:09 — Gavin Greenhorn, CommentedFeb 12, 2022 at 11:09

Reinderien · Accepted Answer · 2022-02-15 14:18:00Z

beauty is not a conventional alias for BeautifulSoup; you're better off using the original.

Your cache mechanism is a little strange, and should probably be deleted. It's not as if you've shown an access pattern that warrants this.

The OOP design hasn't particularly captured the right things. The "object" you care about most is an article and there is no class for that in your implementation. Conversely, WebScraper isn't very useful as a class since it doesn't need to carry around any state, and is probably better as a small collection of free functions. These functions can handle depagination as expressed through a generator.

Note that save_content doesn't save the content: it only saves the journal abstracts.

t is the default mode for open and can be omitted.

You jump through a few hoops to derive a filename from the title that will be (probably) compatible with your filesystem. This ends up with the worst of all worlds: a title that doesn't look as good as the original, which still has no guarantee of filesystem compatibility. Instead, there's a perfectly good article ID as the last part of the URL that you can use. If you still did want to title-mangle for a filename, consider instead a regex sub.

I don't find it useful to carry the concept of a page, which only has meaning in the web search interface, to your filesystem. You could just save these files flat.

The broader question is: why? When you pull an abstract and discard its DOM structure, the resulting text has lost all of its formatting, including some fairly important superscript citations, etc. Why not just save the HTML? You could save an index csv with the additional fields that I've demonstrated you can parse, and then one .html per article.

When you're able, use a soup strainer to constrain the parse breadth.

The Accept-Language that you pass is less important (and is ignored by the server); the more important Accept should be passed which says that you only understand text/html.

Don't input() without passing a prompt to give the user a clue as to why your program mysteriously hung.

Suggested

This does not include file saving.

from datetime import date from textwrap import wrap from typing import NamedTuple, Iterator, Optional from urllib.parse import urljoin from bs4 import BeautifulSoup from bs4.element import SoupStrainer, Tag from requests import Session BASE_URL = 'https://www.nature.com' SEARCH_STRAINER = SoupStrainer('ul', class_='app-article-list-row') ARTICLE_STRAINER = SoupStrainer('div', id='Abs1-content') class Article(NamedTuple): title: str authors: tuple[str, ...] article_type: str date: date url: str description: Optional[str] @classmethod def from_tag(cls, tag: Tag) -> 'Article': anchor = tag.find('a', class_='c-card__link') authors = tuple( span.text for span in tag.find('span', itemprop='name') ) article_type = tag.find('span', class_='c-meta__type').text article_date = date.fromisoformat( tag.find('time')['datetime'] ) description = tag.find('div', itemprop='description') return cls( title=anchor.text, url=urljoin(BASE_URL, anchor['href']), authors=authors, article_type=article_type, date=article_date, description=description and description.text.strip(), ) @property def id(self) -> str: return self.url.rsplit('/', 1)[1] def get_abstract(session: Session, url: str) -> str: with session.get( url=url, headers={'Accept': 'text/html'}, ) as resp: resp.raise_for_status() html = resp.text doc = BeautifulSoup(markup=html, features='html.parser', parse_only=ARTICLE_STRAINER) return doc.text def articles_from_html(html: str) -> Iterator['Article']: doc = BeautifulSoup(markup=html, features='html.parser', parse_only=SEARCH_STRAINER) for tag in doc.find_all('article'): yield Article.from_tag(tag) def scrape_one( session: Session, article_type: str, year: int, page: int, sort: str = 'PubDate', ) -> Iterator[Article]: params = { 'searchType': 'journalSearch', 'sort': sort, 'type': article_type, 'year': year, 'page': page, } with session.get( url=urljoin(BASE_URL, '/nature/articles'), headers={'Accept': 'text/html'}, params=params, ) as resp: resp.raise_for_status() html = resp.text yield from articles_from_html(html) def scrape( session: Session, article_type: str, year: int, pages: int, sort: str = 'PubDate', ) -> Iterator[Article]: for page in range(1, pages+1): yield from scrape_one(session, article_type, year, page, sort) def test() -> None: with Session() as session: for article in scrape( session=session, article_type='article', year=2020, pages=2, ): print(article) print('\n'.join(wrap( get_abstract(session, article.url) ))) print() if __name__ == '__main__': test()

Output

Article(title='Nociceptive nerves regulate haematopoietic stem cell mobilization', authors=('Xin Gao',), article_type='Article', date=datetime.date(2020, 12, 23), url='https://www.nature.com/articles/s41586-020-03057-y', description='Stimulation of pain-sensing neurons, which can be achieved in mice by the ingestion of capsaicin, promotes the migration of haematopoietic stem cells from the bone marrow into the blood.') Haematopoietic stem cells (HSCs) reside in specialized microenvironments in the bone marrow—often referred to as ‘niches’—that represent complex regulatory milieux influenced by multiple cellular constituents, including nerves1,2. Although sympathetic nerves are known to regulate the HSC niche3,4,5,6, the contribution of nociceptive neurons in the bone marrow remains unclear. Here we show that nociceptive nerves are required for enforced HSC mobilization and that they collaborate with sympathetic nerves to maintain HSCs in the bone marrow. Nociceptor neurons drive granulocyte colony-stimulating factor (G-CSF)-induced HSC mobilization via the secretion of calcitonin gene-related peptide (CGRP). Unlike sympathetic nerves, which regulate HSCs indirectly via the niche3,4,6, CGRP acts directly on HSCs via receptor activity modifying protein 1 (RAMP1) and the calcitonin receptor-like receptor (CALCRL) to promote egress by activating the Gαs/adenylyl cyclase/cAMP pathway. The ingestion of food containing capsaicin—a natural component of chili peppers that can trigger the activation of nociceptive neurons—significantly enhanced HSC mobilization in mice. Targeting the nociceptive nervous system could therefore represent a strategy to improve the yield of HSCs for stem cell-based therapeutic agents. ...

Thank you very much for such a detailed response. Although, some elements of your solution went straight over my head (never used Session or yield before and don't understand how is 'from_tag' method's return type is a string(?) 'Article'), I'm looking forward to decipher it all. :) — Gavin Greenhorn, CommentedFeb 16, 2022 at 16:38
This comments section isn't the right place to explain those topics, but I am happy to help in chat.stackexchange.com/rooms/134182/… — Reinderien, CommentedFeb 16, 2022 at 16:41

Stack Exchange Network

Scrape a URL for articles to save

1 Answer 1

Suggested

Output

Hot Network Questions

Scrape a URL for articles to save

1 Answer 1

Suggested

Output

Related

Hot Network Questions