beauty
is not a conventional alias for BeautifulSoup
; you're better off using the original.
Your cache mechanism is a little strange, and should probably be deleted. It's not as if you've shown an access pattern that warrants this.
The OOP design hasn't particularly captured the right things. The "object" you care about most is an article and there is no class for that in your implementation. Conversely, WebScraper
isn't very useful as a class since it doesn't need to carry around any state, and is probably better as a small collection of free functions. These functions can handle depagination as expressed through a generator.
Note that save_content
doesn't save the content: it only saves the journal abstracts.
t
is the default mode for open
and can be omitted.
You jump through a few hoops to derive a filename from the title that will be (probably) compatible with your filesystem. This ends up with the worst of all worlds: a title that doesn't look as good as the original, which still has no guarantee of filesystem compatibility. Instead, there's a perfectly good article ID as the last part of the URL that you can use. If you still did want to title-mangle for a filename, consider instead a regex sub
.
I don't find it useful to carry the concept of a page
, which only has meaning in the web search interface, to your filesystem. You could just save these files flat.
The broader question is: why? When you pull an abstract and discard its DOM structure, the resulting text has lost all of its formatting, including some fairly important superscript citations, etc. Why not just save the HTML? You could save an index csv
with the additional fields that I've demonstrated you can parse, and then one .html
per article.
When you're able, use a soup strainer to constrain the parse breadth.
The Accept-Language
that you pass is less important (and is ignored by the server); the more important Accept
should be passed which says that you only understand text/html
.
Don't input()
without passing a prompt to give the user a clue as to why your program mysteriously hung.
Suggested
This does not include file saving.
from datetime import date from textwrap import wrap from typing import NamedTuple, Iterator, Optional from urllib.parse import urljoin from bs4 import BeautifulSoup from bs4.element import SoupStrainer, Tag from requests import Session BASE_URL = 'https://www.nature.com' SEARCH_STRAINER = SoupStrainer('ul', class_='app-article-list-row') ARTICLE_STRAINER = SoupStrainer('div', id='Abs1-content') class Article(NamedTuple): title: str authors: tuple[str, ...] article_type: str date: date url: str description: Optional[str] @classmethod def from_tag(cls, tag: Tag) -> 'Article': anchor = tag.find('a', class_='c-card__link') authors = tuple( span.text for span in tag.find('span', itemprop='name') ) article_type = tag.find('span', class_='c-meta__type').text article_date = date.fromisoformat( tag.find('time')['datetime'] ) description = tag.find('div', itemprop='description') return cls( title=anchor.text, url=urljoin(BASE_URL, anchor['href']), authors=authors, article_type=article_type, date=article_date, description=description and description.text.strip(), ) @property def id(self) -> str: return self.url.rsplit('/', 1)[1] def get_abstract(session: Session, url: str) -> str: with session.get( url=url, headers={'Accept': 'text/html'}, ) as resp: resp.raise_for_status() html = resp.text doc = BeautifulSoup(markup=html, features='html.parser', parse_only=ARTICLE_STRAINER) return doc.text def articles_from_html(html: str) -> Iterator['Article']: doc = BeautifulSoup(markup=html, features='html.parser', parse_only=SEARCH_STRAINER) for tag in doc.find_all('article'): yield Article.from_tag(tag) def scrape_one( session: Session, article_type: str, year: int, page: int, sort: str = 'PubDate', ) -> Iterator[Article]: params = { 'searchType': 'journalSearch', 'sort': sort, 'type': article_type, 'year': year, 'page': page, } with session.get( url=urljoin(BASE_URL, '/nature/articles'), headers={'Accept': 'text/html'}, params=params, ) as resp: resp.raise_for_status() html = resp.text yield from articles_from_html(html) def scrape( session: Session, article_type: str, year: int, pages: int, sort: str = 'PubDate', ) -> Iterator[Article]: for page in range(1, pages+1): yield from scrape_one(session, article_type, year, page, sort) def test() -> None: with Session() as session: for article in scrape( session=session, article_type='article', year=2020, pages=2, ): print(article) print('\n'.join(wrap( get_abstract(session, article.url) ))) print() if __name__ == '__main__': test()
Output
Article(title='Nociceptive nerves regulate haematopoietic stem cell mobilization', authors=('Xin Gao',), article_type='Article', date=datetime.date(2020, 12, 23), url='https://www.nature.com/articles/s41586-020-03057-y', description='Stimulation of pain-sensing neurons, which can be achieved in mice by the ingestion of capsaicin, promotes the migration of haematopoietic stem cells from the bone marrow into the blood.') Haematopoietic stem cells (HSCs) reside in specialized microenvironments in the bone marrow—often referred to as ‘niches’—that represent complex regulatory milieux influenced by multiple cellular constituents, including nerves1,2. Although sympathetic nerves are known to regulate the HSC niche3,4,5,6, the contribution of nociceptive neurons in the bone marrow remains unclear. Here we show that nociceptive nerves are required for enforced HSC mobilization and that they collaborate with sympathetic nerves to maintain HSCs in the bone marrow. Nociceptor neurons drive granulocyte colony-stimulating factor (G-CSF)-induced HSC mobilization via the secretion of calcitonin gene-related peptide (CGRP). Unlike sympathetic nerves, which regulate HSCs indirectly via the niche3,4,6, CGRP acts directly on HSCs via receptor activity modifying protein 1 (RAMP1) and the calcitonin receptor-like receptor (CALCRL) to promote egress by activating the Gαs/adenylyl cyclase/cAMP pathway. The ingestion of food containing capsaicin—a natural component of chili peppers that can trigger the activation of nociceptive neurons—significantly enhanced HSC mobilization in mice. Targeting the nociceptive nervous system could therefore represent a strategy to improve the yield of HSCs for stem cell-based therapeutic agents. ...
Scraper
, notScrapper
\$\endgroup\$