Python: web scraping pages with js

Question

I'm trying to scrape LinkedIn using selenium. Here's a page for example: https://www.linkedin.com/vsearch/p?firstName=mark

I can see in the html that the search results are in the:

<div id='results-col'> ... </div>

but when I try to access this tag using Beautifulsoup:

browser = webdriver.PhantomJS(executable_path=PATH) browser.get(url) bs_obj = BeautifulSoup(browser.page_source, "html.parser") results_col = bs_obj.find("div", {"id": "results-col"})

I get nothing(results_col=None). What am I doing wrong?

Add a sleep after the browser.get for the js to load
– Tobey
CommentedDec 14, 2016 at 19:32 — Tobey, CommentedDec 14, 2016 at 19:32

alecxe · Accepted Answer · 2016-12-14 19:38:16Z

2

Wait for the desired element to be present and only then get the page source:

from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC # ... browser.get(url) wait = WebDriverWait(browser, 10) wait.until(EC.presence_of_element_located((By.ID, "results-col"))) bs_obj = BeautifulSoup(browser.page_source, "html.parser")

answered Dec 14, 2016 at 19:38

alecxe

475k127 gold badges1.1k silver badges1.2k bronze badges

I tried your code but I get: Traceback (most recent call last): File X, line 142, in <module> print(get_link_to_profile(search_url)) File X, line 121, in get_link_to_profile wait.until(EC.presence_of_element_located((By.ID, "results-col"))) File "C:\Users\sergeyy\AppData\Roaming\Python\Python35\site-packages\selenium\webdriver\support\wait.py", line 80, in until raise TimeoutException(message, screen, stacktrace) selenium.common.exceptions.TimeoutException: Message: Screenshot: available via screen
– Bob Sacamano
CommentedDec 14, 2016 at 20:15
@BobSacamano that could mean different things, but you don't have this element on the page opened with PhantomJS. Take a screenshot with take_screenshot() method after loading the page and see what is actually opened. You might need to start PhantomJS with some arguments to make it work: stackoverflow.com/questions/29463603/….
– alecxe
CommentedDec 14, 2016 at 21:12
@BobSacamano or, you may need to tweak the user agent to pretend to be a different browser: coderwall.com/p/9jgaeq/set-phantomjs-user-agent-string.
– alecxe
CommentedDec 14, 2016 at 21:14

Add a comment |

Collectives™ on Stack Overflow

Python: web scraping pages with js

1 Answer 1

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Linked

Related