Data scraping from a webpage with javascript using python

Question

I'm trying to scrape the title off of a webpage. Initially, I tried using BeautifulSoup but found out that the page itself wouldn't load without Javascript. So I'm using some code that I found off Google that use the request-html library:

from requests_html import HTMLSession from bs4 import BeautifulSoup session = HTMLSession() resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601") resp.html.render() soup = BeautifulSoup(resp.html.html, "lxml") soup.find_all('h1')

But there's always an error along the line of:

D:\Python\TitleSraping\venv\Scripts\python.exe "D:/Python/TitleSraping/venv/Text Scraping.py" Traceback (most recent call last): File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 106, in evaluateHandle 'userGesture': True, pyppeteer.errors.NetworkError: Protocol error (Runtime.callFunctionOn): Cannot find context with specified id During handling of the above exception, another exception occurred: Traceback (most recent call last): File "D:/Python/TitleSraping/venv/Text Scraping.py", line 5, in <module> resp.html.render() File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 598, in render content, result, page = self.session.loop.run_until_complete(self._async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page)) File "D:\Program Files (x86)\Python\lib\asyncio\base_events.py", line 584, in run_until_complete return future.result() File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 531, in _async_render content = await page.content() File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\page.py", line 780, in content return await frame.content() File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 379, in content '''.strip()) File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 295, in evaluate pageFunction, *args, force_expr=force_expr) File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 55, in evaluate pageFunction, *args, force_expr=force_expr) File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 109, in evaluateHandle _rewriteError(e) File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 238, in _rewriteError raise type(error)(msg) pyppeteer.errors.NetworkError: Execution context was destroyed, most likely because of a navigation. Process finished with exit code 1

Does anyone know what this means? I'm quite new to this, so I apologize if I'm using any terminology improperly.

NBlack · Accepted Answer · 2019-06-24 23:47:57Z

As Ivan said, here you have full code: sleep=1, keep_page=True make the trick

from requests_html import HTMLSession from bs4 import BeautifulSoup session = HTMLSession() resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601") resp.html.render(sleep=1, keep_page=True) soup = BeautifulSoup(resp.html.html, "lxml") print(soup.find_all('title'))

Response:

[<title> Milled wheat and wheat flour produced</title>]

hmm, i wish this what was i was getting, but i still seem to get the same error — facsasd, CommentedJun 24, 2019 at 23:55
Did you tried with my code? I run in my console (Python 3.7) and its working like a charm. Please, paste your code now to fix it :) — NBlack, CommentedJun 25, 2019 at 15:30
So... i did try your code... sometimes it works sometimes it doesn't and i honestly don't know why anymore — facsasd, CommentedJun 25, 2019 at 16:15
I tried 10 times one behind other and works...try to put sleep=2 (2 seconds) if your internet is slow up to 5 sec. sleep – Integer, if provided, of how many long to sleep after initial render. — NBlack, CommentedJun 26, 2019 at 8:56

Ivan Sveshnikov · Accepted Answer · 2019-06-24 23:39:20Z

0

Seems like a bug in underlying library puppeteer, caused by processing some javascript. Here's one workaround from https://github.com/kennethreitz/requests-html/issues/251, maybe it'll help.

resp.html.render(sleep=1, keep_page=True)

answered Jun 24, 2019 at 23:39

Ivan Sveshnikov

3994 silver badges11 bronze badges

I tried it out, i still seem to be getting a similar error
– facsasd
CommentedJun 24, 2019 at 23:57
You might try to increase sleep parameter. If your page is heavy and machine is slow, it can help.
– Ivan Sveshnikov
CommentedJun 25, 2019 at 19:12
Note for my future self, or, other people: I try specifying only keep_page=True, and it's enough to do the trick.
– Nuclear241
CommentedJan 6, 2022 at 13:03

Add a comment |

Andrés Aviña · Accepted Answer · 2019-06-24 23:45:55Z

0

You need to load the JS because if you don't load it the HTML code wont load. You can use Selenium

answered Jun 24, 2019 at 23:45

Andrés Aviña

112 bronze badges

hmm, I'm trying to follow along to this tutorial theautomatic.net/2019/01/19/… not sure how it works there
– facsasd
CommentedJun 25, 2019 at 0:00
The problem is specifically with the page you want to scrape, because it has security against scrapers.
– Andrés Aviña
CommentedJun 25, 2019 at 21:17

Add a comment |

lowtex · Accepted Answer · 2019-06-24 23:45:55Z

Try Seleneum.

Seleneum is a library that allows programs to interact with web pages by taking control of the browser.

Here is an example in an answer to someone else's question.

hmm, I'm trying to follow along to this tutorial theautomatic.net/2019/01/19/… not sure how it works there — facsasd, CommentedJun 25, 2019 at 0:00

Collectives™ on Stack Overflow

Data scraping from a webpage with javascript using python

4 Answers 4

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Linked

Related