6

I am trying to scrape a website. I have tried using two methods but both do not provide me with the full website source code that I am looking for. I am trying to scrape the news titles from the website URL provided below.

URL: "https://www.todayonline.com/"

These are the two methods I have tried but failed.

Method 1: Beautiful Soup

tdy_url = "https://www.todayonline.com/" page = requests.get(tdy_url).text soup = BeautifulSoup(page) soup # Returns me a HTML with javascript text soup.find_all('h3') ### Returns me empty list [] 

Method 2: Selenium + BeautifulSoup

tdy_url = "https://www.todayonline.com/" options = Options() options.headless = True driver = webdriver.Chrome("chromedriver",options=options) driver.get(tdy_url) time.sleep(10) html = driver.page_source soup = BeautifulSoup(html) soup.find_all('h3') ### Returns me only less than 1/4 of the 'h3' tags found in the original page source 

Please help. I have tried scraping other news websites and it is so much easier. Thank you.

2
  • 1
    The news data on the website you are trying to scrape is fetched with JavaScript, and is not returned by the server. But in the first example you are getting just the page returned by the server -- neither requests nor BeautifulSoup execute JS. However, you can open the Firefox (Chromium) DevTools and take a look at which requests get the data from the server, and try to imitate them with requests then. It might be even easier than trying to do webscraping with BeautifulSoup.CommentedSep 6, 2020 at 8:37
  • 1
    See the @politicalscientist answer also. He does exactly what I descriped in the first comment.CommentedSep 6, 2020 at 8:40

4 Answers 4

5

The news data on the website you are trying to scrape is fetched from the server using JavaScript (this is called XHR -- XMLHttpRequest). It is happening dynamically, while the page is loading or being scrolled. so this data is not returned inside the page returned by the server.

In the first example, you are getting only the page returned by the server -- without the news, but with JS that is supposed to get them. Neither requests nor BeautifulSoup can execute JS.

However, you can try to reproduce requests that are getting news titles from the server with Python requests. Do the following steps:

  1. Open DevTools of your browser (usually you have to press F12 or the combination of Ctrl+Shift+I for that), and take a look at requests that are getting news titles from the server. Sometimes, it is even easier than web scraping with BeautifulSoup. Here is a screenshot (Firefox): Screenshot (Firefox)
  1. Copy the request link (right-click -> Copy -> Copy link), and pass it to requests.get(...).

  2. Get .json() of the request. It will return a dict that is easy to work with. To better understand the structure of the dict, I would recommend to use pprint instead of simple print. Note you have to do from pprint import pprint before using it.

Here is an example of the code that gets the titles from the main news on the page:

import requests nodes = requests.get("https://www.todayonline.com/api/v3/news_feed/7")\ .json()["nodes"] for node in nodes: print(node["node"]["title"]) 

If you want to scrape a group of news under caption, you need to change the number after news_feed/ in the request URL (to get it, you just need to filter the requests by "news_feed" in the DevTools and scroll the news page down).

Sometimes web sites have protection against bots (although the website you are trying to scrape doesn't). In such cases, you might need to do these steps as well.

0
    3

    You can access data via API (check out the Network tab): enter image description here


    For example,

    import requests url = "https://www.todayonline.com/api/v3/news_feed/7" data = requests.get(url).json() 
    0
      2

      I will suggest you the fairly simple approach,

      import requests from bs4 import BeautifulSoup as bs page = requests.get('https://www.todayonline.com/googlenews.xml').content soup = bs(page) news = [i.text for i in soup.find_all('news:title')] print(news) 

      output

      ['DBS named world’s best bank by New York-based financial publication', 'Russia has very serious questions to answer on Navalny - UK', "Exclusive: 90% of China's Sinovac employees, families took coronavirus vaccine - CEO", 'Three militants killed after fatal attack on policeman in Tunisia', .....] 

      Also, you can check the XML page for more information if required.

      P.S. Always check for the compliance before scraping any website :)

        0

        There are different ways of gathering the content of a webpage that contains Javascript.

        1. Using selenium with Firefox web driver
        2. Using a headless browser with phantomJS
        3. Making an API call using a REST client or python requests library

        You have to do your research first

          Start asking to get answers

          Find the answer to your question by asking.

          Ask question

          Explore related questions

          See similar questions with these tags.