Python Web scraping

Question

I am 12 days old into Python and web scraping and managed to write my first ever automation script. Please review my code and point out blunders If any.

What do I want to achieve?

I want to scrape all chapters of each Novel in each category and post it on a WordPress blog to test. Please point out anything that I missed, and is mandatory to run this script on the WordPress blog.

from requests import get from bs4 import BeautifulSoup import re r = get(site, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"}) soup = BeautifulSoup(r.text, "lxml") category = soup.findAll(class_="search-by-genre") # Getting all categories categories = [] for link in soup.findAll(href=re.compile(r'/category/\w+$')): print("Category:", link.text) category_link = link['href'] categories.append(category_link) # Getting all Novel Headers for category in categories: r = get(category_link, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"}) soup = BeautifulSoup(r.text, "lxml") Novels_header = soup.findAll(class_="top-novel-header") # Getting Novels' Title and Link for Novel_names in Novels_header: print("Novel:", Novel_names.text.strip()) Novel_link = Novel_names.find('a')['href'] # Getting Novel's Info r = get(Novel_link, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"}) soup = BeautifulSoup(r.text, "lxml") Novel_divs = soup.findAll(class_="chapter-chs") # Novel Chapters for articles in Novel_divs: article_ch = articles.findAll("a") for chapters in article_ch: ch = chapters["href"] # Getting article r = get(ch, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"}) soup = BeautifulSoup(r.content, "lxml") title = soup.find(class_="block-title") print(title.text.strip()) full_article = soup.find("div", {"class": "desc"}) # remove ads inside the text: for ads in full_article.select('center, small, a'): ads.extract() print(full_article.get_text(strip=True, separator='\n'))

Ben A · Accepted Answer · 2020-05-16 19:00:06Z

Naming

Variable names should be snake_case, and should represent what they are containing. I would also use req instead of r. The extra two characters aren't going to cause a heartache.

Constants

You have the same headers dict in four different places. I would instead define it once at the top of the file in UPPER_CASE, then just use that wherever you need headers. I would do the same for site.

List Comprehension

I would go about collecting categories in this way:

categories = [link['href'] for link in soup.findAll(href=re.compile(r'/category/\w+$'))]

It's shorter and utilizes a quirk in the python language. Of course, if you want to print out each one, then add this just after:

for category in categories: print(category)

Also, it seems like you assign category_link to the last element in the list, so that can go just outside the list comprehension.

Save your assignments

Instead of assigning the result of soup.find to a variable, then using it in a loop, simply put that soup.find in the loop. Take a look:

for articles in soup.findAll(class_="chapter-chs"): for chapters in articles.findAll("a"): ....

As a result of the above changes, you code would look something like this:

from requests import get from bs4 import BeautifulSoup import re HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"} SITE = "https://readlightnovel.org/" req = get(SITE, headers=HEADERS) soup = BeautifulSoup(req.text, "lxml") category = soup.findAll(class_="search-by-genre") categories = [link['href'] for link in soup.findAll(href=re.compile(r'/category/\w+$'))] category_link = categories[-1] # Getting all Novel Headers for category in categories: req = get(category_link, headers=HEADERS) soup = BeautifulSoup(req.text, "lxml") novels_header = soup.findAll(class_="top-novel-header") # Getting Novels' Title and Link for novel_names in novels_header: print("Novel:", novel_names.text.strip()) novel_link = novel_names.find('a')['href'] # Getting Novel's Info req = get(novel_link, headers=HEADERS) soup = BeautifulSoup(req.text, "lxml") # Novel Chapters for articles in soup.findAll(class_="chapter-chs"): for chapters in articles.findAll("a"): ch = chapters["href"] # Getting article req = get(ch, headers=HEADERS) soup = BeautifulSoup(req.content, "lxml") title = soup.find(class_="block-title") print(title.text.strip()) full_article = soup.find("div", {"class": "desc"}) # remove ads inside the text: for ads in full_article.select('center, small, a'): ads.extract() print(full_article.get_text(strip=True, separator='\n'))

Kate · Accepted Answer · 2020-05-16 22:09:05Z

I think you can even get rid of the regular expressions. I prefer to use the BS4 functions.

Instead of:

categories = [link['href'] for link in soup.findAll(href=re.compile(r'/category/\w+$'))]

This statement is equivalent using a CSS selector:

categories = [link['href'] for link in soup.select("a[href*=\/category\/]")]

That means: fetch all the a href tags that contain text /category/ (escaped).

Stack Exchange Network

Python Web scraping

2 Answers 2

Naming

Constants

List Comprehension

Save your assignments

Hot Network Questions

Python Web scraping

2 Answers 2

Naming

Constants

List Comprehension

Save your assignments

Related

Hot Network Questions