I want to scrape the Number of participants of the following news. The url is http://news.sina.com.cn/c/2013-07-11/175827642839.shtml And I want to get the Number 820. It is generated by javascript. How can I get that number using simple way?
1 Answer
You could analize javascript code and do the same in python. Or you can use Selenium in Python.
edit:
Here example from selenium page changed to do what you need.
It open browser (firefox), wait 5 second (to load page) and get text
#!/usr/bin/python import selenium from selenium import webdriver from selenium.common.exceptions import NoSuchElementException from selenium.webdriver.common.keys import Keys import time browser = webdriver.Firefox() # Get local session of firefox browser.get("http://news.sina.com.cn/c/2013-07-11/175827642839.shtml ") # Load page time.sleep(5) # Let the page load try: element = browser.find_element_by_xpath("//span[contains(@class,'f_red')]") # get element on page print element.text # get element text except NoSuchElementException: assert 0, "can't find f_red" browser.close()
- I added example in my answer. It use Firefox to get what you need.– furasCommentedJul 14, 2013 at 9:09
- Yesterday on page was 820. Today on page is 823. So today my example give 823 (
print element.text
). Or I'm looking in wrong place.– furasCommentedJul 14, 2013 at 12:51 - Yeah,the code is excellent,but it will open the FireFox browser.If I have millions of web page to scrape,it will not be effective. Can you have some tips for that?– mjcCommentedJul 15, 2013 at 1:29
- I heared that there is
webdriver.i_dont_remember_name
which don't open any browser` but it still needstime.sleep
to wait for javascript. For scraping I use urllib + pyQuery but it work only with HTML - so I get javascript, analyze what it is doing step by step, I look for source of information. If I find some url (mostly in ajax) I can try to use it directly in python. This way script can work fast enought to get milions pages (you can usethreads
to get more pages at the same time).– furasCommentedJul 15, 2013 at 2:00 - But sometimes script works too fast and server know that it have to be script or bot. Servers don't like bots - bots don't click advertisements and servers don't earn money :)– furasCommentedJul 15, 2013 at 2:01