2
\$\begingroup\$

I have a script that iterates through a series of URLs that have the same pattern to parse them via BeautifulSoup. The URL structure is an aspx page that ends its route with a sequential id, very similar to this one here.

In my case the challenge is I'm iterating through several routes and I don't know the endpoint, so I don't know when to stop looping through the series.

My attempted remedy could be found in the below code sample. Here you see my soup function that makes a request to the url and returns the soup when the request returns a good status (200).

The next function is a binary search that loops through the range I give with a while loop which stops running when I get a last successful URL request.

import requests from bs4 import BeautifulSoup #### # # Souped Up just makes the request and passes you the soup to parse when there is one available. Just pass the URL. # #### def soupedUp(url): theRequest = requests.get(url, allow_redirects=False) if theRequest.status_code == 200: soup = BeautifulSoup(theRequest.text, "lxml") else: soup = None return soup def binarySearch(theRange): first = 0 last = len(theRange)-1 while first <= last: middle = (first + last) // 2 if soupedUp(url+str(middle)) is None: last = middle - 1 else: first = middle + 1 return middle url = 'http://cornpalace.com/gallery.aspx?PID=' print(binarySearch(range(1,10000000))) 

My functions are working, but I feel like there may be a faster, simpler or cleaner approach to find the last route in this URL.

Does anyone have a simpler way to handle this sort of looping through URLs while scraping the same pattern URL?

I'd be happy to see another approach at this, or any python modules that already offer this type of URL probing.

\$\endgroup\$
2
  • 1
    \$\begingroup\$stackoverflow.com/questions/212358/… Might help.\$\endgroup\$
    – coldspeed
    CommentedAug 4, 2017 at 1:28
  • \$\begingroup\$@cᴏʟᴅsᴘᴇᴇᴅ thanks. This was along the lines of what I was looking for. A method that could simplify my ability to identify the final route I have to parse.\$\endgroup\$
    – Dom DaFonte
    CommentedAug 4, 2017 at 2:58

1 Answer 1

1
\$\begingroup\$

If I understand the problem correctly, the goal is still to parse all the available pages - you can initiate an endless loop and break it once the status code is not 200. And, we may reuse the same web-scraping session to improve on performance:

URL_TEMPLATE = 'http://cornpalace.com/gallery.aspx?PID={page}' with requests.Session() as session: page = 1 while True: response = session.get(URL_TEMPLATE.format(page=page), allow_redirects=False) if response.status_code != 200: break print("Processing page #{page}".format(page=page)) soup = BeautifulSoup(response.text, "lxml") # parse the page page += 1 
\$\endgroup\$
8
  • 1
    \$\begingroup\$While this would work, unless you expect a very small number of pages, you may want to consider jumping ahead by more than one and then going back to find the exact number.\$\endgroup\$
    – Michael Mior
    CommentedAug 4, 2017 at 1:36
  • \$\begingroup\$@MichaelMior if I understood the problem right, the OP still wants to parse all the pages available..if the goal is to find the maximum number of pages, then binary search looks compelling..would do it in log N time..thanks\$\endgroup\$
    – alecxe
    CommentedAug 4, 2017 at 1:38
  • \$\begingroup\$If that's the case, you're right there's no point in jumping ahead. But you can't do binary search since you don't know the end. If you did, then there wouldn't be a need for it. You could certainly take a similar approach and probably come up with a similar runtime however.\$\endgroup\$
    – Michael Mior
    CommentedAug 4, 2017 at 1:42
  • \$\begingroup\$@MichaelMior right, I think we may use the context we are doing binary search in - there is probably some information which can help us to see what number of pages can be available..some practical limit. May be there is even the number of total results on the search result page and, then, taking into account results per page count, we can calculate the number of pages..in this case, we would not even need the binary search though :)\$\endgroup\$
    – alecxe
    CommentedAug 4, 2017 at 1:45
  • \$\begingroup\$Thanks. I'm upvoting because the Sessions method will speed up my operation, but this doesn't answer my question. I want to know the last page of each route. Knowing that will help me keep track of this operation and remove wasted requests. I currently stop my loop when I get 100 invalid requests. @MichaelMior is correct, I loop through a lot of pages. I first loop through 20-30 routes and for each of those I loop through between 2000 and 34k which I parse & insert into a db. It takes days to complete which is why I want to track the number of required requests and remove wasted requests.\$\endgroup\$
    – Dom DaFonte
    CommentedAug 4, 2017 at 3:30

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.