0

I'm trying to scrape the title off of a webpage. Initially, I tried using BeautifulSoup but found out that the page itself wouldn't load without Javascript. So I'm using some code that I found off Google that use the request-html library:

from requests_html import HTMLSession from bs4 import BeautifulSoup session = HTMLSession() resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601") resp.html.render() soup = BeautifulSoup(resp.html.html, "lxml") soup.find_all('h1') 

But there's always an error along the line of:

D:\Python\TitleSraping\venv\Scripts\python.exe "D:/Python/TitleSraping/venv/Text Scraping.py" Traceback (most recent call last): File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 106, in evaluateHandle 'userGesture': True, pyppeteer.errors.NetworkError: Protocol error (Runtime.callFunctionOn): Cannot find context with specified id During handling of the above exception, another exception occurred: Traceback (most recent call last): File "D:/Python/TitleSraping/venv/Text Scraping.py", line 5, in <module> resp.html.render() File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 598, in render content, result, page = self.session.loop.run_until_complete(self._async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page)) File "D:\Program Files (x86)\Python\lib\asyncio\base_events.py", line 584, in run_until_complete return future.result() File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 531, in _async_render content = await page.content() File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\page.py", line 780, in content return await frame.content() File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 379, in content '''.strip()) File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 295, in evaluate pageFunction, *args, force_expr=force_expr) File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 55, in evaluate pageFunction, *args, force_expr=force_expr) File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 109, in evaluateHandle _rewriteError(e) File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 238, in _rewriteError raise type(error)(msg) pyppeteer.errors.NetworkError: Execution context was destroyed, most likely because of a navigation. Process finished with exit code 1 

Does anyone know what this means? I'm quite new to this, so I apologize if I'm using any terminology improperly.

    4 Answers 4

    1

    As Ivan said, here you have full code: sleep=1, keep_page=True make the trick

    from requests_html import HTMLSession from bs4 import BeautifulSoup session = HTMLSession() resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601") resp.html.render(sleep=1, keep_page=True) soup = BeautifulSoup(resp.html.html, "lxml") print(soup.find_all('title')) 

    Response:

    [<title> Milled wheat and wheat flour produced</title>] 
    5
    • hmm, i wish this what was i was getting, but i still seem to get the same error
      – facsasd
      CommentedJun 24, 2019 at 23:55
    • Did you tried with my code? I run in my console (Python 3.7) and its working like a charm. Please, paste your code now to fix it :)
      – NBlack
      CommentedJun 25, 2019 at 15:30
    • So... i did try your code... sometimes it works sometimes it doesn't and i honestly don't know why anymore
      – facsasd
      CommentedJun 25, 2019 at 16:15
    • I'll try to replicate it
      – facsasd
      CommentedJun 25, 2019 at 17:47
    • I tried 10 times one behind other and works...try to put sleep=2 (2 seconds) if your internet is slow up to 5 sec. sleep – Integer, if provided, of how many long to sleep after initial render.
      – NBlack
      CommentedJun 26, 2019 at 8:56
    0

    Seems like a bug in underlying library puppeteer, caused by processing some javascript. Here's one workaround from https://github.com/kennethreitz/requests-html/issues/251, maybe it'll help.

    resp.html.render(sleep=1, keep_page=True)

    3
    • I tried it out, i still seem to be getting a similar error
      – facsasd
      CommentedJun 24, 2019 at 23:57
    • You might try to increase sleep parameter. If your page is heavy and machine is slow, it can help.CommentedJun 25, 2019 at 19:12
    • Note for my future self, or, other people: I try specifying only keep_page=True, and it's enough to do the trick.CommentedJan 6, 2022 at 13:03
    0

    You need to load the JS because if you don't load it the HTML code wont load. You can use Selenium

    2
    • hmm, I'm trying to follow along to this tutorial theautomatic.net/2019/01/19/… not sure how it works there
      – facsasd
      CommentedJun 25, 2019 at 0:00
    • The problem is specifically with the page you want to scrape, because it has security against scrapers.CommentedJun 25, 2019 at 21:17
    0

    Try Seleneum.

    Seleneum is a library that allows programs to interact with web pages by taking control of the browser.

    Here is an example in an answer to someone else's question.

    1

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.