0

I want to scrape the Number of participants of the following news. The url is http://news.sina.com.cn/c/2013-07-11/175827642839.shtml And I want to get the Number 820. It is generated by javascript. How can I get that number using simple way?

1

1 Answer 1

1

You could analize javascript code and do the same in python. Or you can use Selenium in Python.

edit:

Here example from selenium page changed to do what you need.

It open browser (firefox), wait 5 second (to load page) and get text

#!/usr/bin/python import selenium from selenium import webdriver from selenium.common.exceptions import NoSuchElementException from selenium.webdriver.common.keys import Keys import time browser = webdriver.Firefox() # Get local session of firefox browser.get("http://news.sina.com.cn/c/2013-07-11/175827642839.shtml ") # Load page time.sleep(5) # Let the page load try: element = browser.find_element_by_xpath("//span[contains(@class,'f_red')]") # get element on page print element.text # get element text except NoSuchElementException: assert 0, "can't find f_red" browser.close() 
5
  • I added example in my answer. It use Firefox to get what you need.
    – furas
    CommentedJul 14, 2013 at 9:09
  • Yesterday on page was 820. Today on page is 823. So today my example give 823 (print element.text). Or I'm looking in wrong place.
    – furas
    CommentedJul 14, 2013 at 12:51
  • Yeah,the code is excellent,but it will open the FireFox browser.If I have millions of web page to scrape,it will not be effective. Can you have some tips for that?
    – mjc
    CommentedJul 15, 2013 at 1:29
  • I heared that there is webdriver.i_dont_remember_name which don't open any browser` but it still needs time.sleep to wait for javascript. For scraping I use urllib + pyQuery but it work only with HTML - so I get javascript, analyze what it is doing step by step, I look for source of information. If I find some url (mostly in ajax) I can try to use it directly in python. This way script can work fast enought to get milions pages (you can use threads to get more pages at the same time).
    – furas
    CommentedJul 15, 2013 at 2:00
  • But sometimes script works too fast and server know that it have to be script or bot. Servers don't like bots - bots don't click advertisements and servers don't earn money :)
    – furas
    CommentedJul 15, 2013 at 2:01

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.