Scraping urls asynchronous including exception handling

Name: python - Scraping urls asynchronous including exception handling - Code Review Stack Exchange
Rating: 4.7 (4306 reviews)

Question

I'm trying to understand how to work with aiohttp and asyncio. The code below retrieves all websites in urls and prints out the "size" of each response.

Is the error handling within the fetch method correct?
Is it possible to remove the result of a specific url from results in case of an exception - making return (url, '') unnecessary?
Is there a better way than ssl=False to deal with a potential ssl.SSLCertVerificationError?
Any additional advice on how i can improve my code quality is highly appreciated

import asyncio import aiohttp async def fetch(session, url): try: async with session.get(url, ssl=False) as response: return url, await response.text() except aiohttp.client_exceptions.ClientConnectorError as e: print(e) return (url, '') async def main(): tasks = [] urls = [ 'http://www.python.org', 'http://www.jython.org', 'http://www.pypy.org' ] async with aiohttp.ClientSession() as session: while urls: tasks.append(fetch(session, urls.pop())) results = await asyncio.gather(*tasks) [print(f'{url}: {len(result)}') for url, result in results] if __name__ == '__main__': loop = asyncio.get_event_loop() loop.run_until_complete(main()) loop.close()

Update

Is there a way how i can add tasks to the list from within the "loop"? e.g. add new urls while scraping a website and finding new subdomains to scrape.

301_Moved_Permanently · Accepted Answer · 2018-07-24 09:55:20Z

tasks = [] while urls: tasks.append(fetch(session, urls.pop()))

can be largely simplified to

tasks = [fetch(session, url) for url in urls]

Is it possible to remove the result of a specific url from results in case of an exception - making return (url, '') unnecessary?

Yes, somewhat. asyncio.gather accept a return_exceptions parameters. Set it to True to avoid a single exception failing the gather call. You must filter them out afterwards anyway:

import asyncio import aiohttp async def fetch(session, url): async with session.get(url, ssl=False) as response: return await response.text() async def main(): urls = [ 'http://www.python.org', 'http://www.jython.org', 'http://www.pypy.org' ] async with aiohttp.ClientSession() as session: tasks = [fetch(session, url) for url in urls] results = await asyncio.gather(*tasks) for url, result in zip(urls, results): if not isinstance(result, Exception): print(f'{url}: {len(result)}') else: print(f'{url} FAILED') if __name__ == '__main__': loop = asyncio.get_event_loop() loop.run_until_complete(main()) loop.close()

Not sure if your code example is meant to catch exceptions or if you just wanted to point out how to deal with exceptions in the results list. I only manage to have a working code if i put try... except within the fetch method. If put return e in except your code works. — RandomDude, CommentedJul 24, 2018 at 19:46

Stack Exchange Network

Scraping urls asynchronous including exception handling

1 Answer 1

Hot Network Questions

Scraping urls asynchronous including exception handling

1 Answer 1

Related

Hot Network Questions