Scraping web data using asynchronous request

Name: python - Scraping web data using asynchronous request - Code Review Stack Exchange
Rating: 4.6 (9411 reviews)

Question

I've written a script using python to grab different categories from a webpage. I used "grequests" in my scraper to perform the activity. My intention here was to perform the action swiftly making asynchronous HTTP requests. My scraper is running flawlessly and collecting data as it should. However, in case of performance, I'm not sure it is giving the optimum. Any suggestion to make it better will be highly appreciated.

import grequests ; from lxml import html main_link = "http://quotes.toscrape.com/" def toscrape_scraper(item_link): storage = [item_link] # Depositing link as a list response = (grequests.get(req) for req in storage) for req in grequests.map(response): #Sending requests tree = html.fromstring(req.text) for titles in tree.cssselect("span.tag-item a.tag"): grabbing_docs(main_link + titles.attrib['href']) def grabbing_docs(base_link): vault = [base_link] # Storing links as a list res = (grequests.get(req) for req in vault) for hreq in grequests.map(res): #Sending requests root = html.fromstring(hreq.text) for soups in root.cssselect("div.quote"): quote = soups.cssselect("span.text")[0].text author = soups.cssselect("small.author")[0].text print(quote, author) next_page = root.cssselect("li.next a")[0].attrib['href'] if root.cssselect("li.next a") else "" if next_page: page_link = main_link + next_page grabbing_docs(page_link) #Reusing the newly collected paginated links toscrape_scraper(main_link)

Are you certain that you won't request a link (through grabbing_docs(link)) more than once in the current implementation? If that were to happen, you could enter a cycle. — act, CommentedSep 13, 2017 at 10:31

Stack Exchange Network

Scraping web data using asynchronous request

0

Hot Network Questions

Scraping web data using asynchronous request

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Related

Hot Network Questions