3

I am a complete newbie to web scraping; I have this small project of scraping some data from COCA but I don't even know where to start. It seems that this webpage is built using some Javascript and I wonder if there is some package that enables me to interact with it?

Here is some tasks that I want my program to do:

  1. log in using one's account;
  2. Choose a tab (e.g. search, chart, etc, please see COCA);
  3. type in the word you want to search in the textbook;
  4. scrape the search results.

Any suggestions would be greatly appreciated.

PS: Ideally everything should work at backstage (won't open the browser).

6
  • There also selenium which you can also use to execute js on websites.
    – Marcin
    CommentedNov 10, 2016 at 0:56
  • or phantomjs.orgCommentedNov 10, 2016 at 0:58
  • 1
    @Marcin Thanks for the reply, yes I looked into selenium but I don't want my program to open the browser. Ideally everything works at backstage. Any suggestion?
    – Bayesric
    CommentedNov 10, 2016 at 1:05
  • 1
    selenium can use phantomjs as headless browser (it means without displaying window). It can run Firefox/Chrome as headless browser too but it may need some work.
    – furas
    CommentedNov 10, 2016 at 1:56
  • 1
    or you can "analyze" data send between browser and server (using DevTool in Chrome/Firefox) and then use this information to skip page rendering and running JavaScript - but this need more work and knowledge about HTTP.
    – furas
    CommentedNov 10, 2016 at 2:01

2 Answers 2

1
from pyvirtualdisplay import Display from selenium import webdriver display = Display(visible=0, size=(800, 600)) display.start() browser = webdriver.Firefox() browser.get('http://www.google.com') print browser.title browser.quit() display.stop() 

pyvirtualdisplay in headless mode Display(visible=0) requires Xvbf, that is a feature of Linux. Read more here on Xvbf usage.

1
  • 1
    Note that using pyvirtualdisplay with visible=False requires Xvbf, and therefore this cannot be used on a Windows machine.
    – Niko Fohr
    CommentedOct 17, 2017 at 6:19
1

As some people have told you, you can use selenium. I recommend you to enter in the developers tools of your browser and follow the network requests that make the site, depending of the behavior of the page maybe you can do it with the python module request to simulate the request that you saw that was making the site, personally i think that it is simpler. If you can't emulate the request then use selenium.

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.