I have a script that iterates through a series of URLs that have the same pattern to parse them via BeautifulSoup. The URL structure is an aspx page that ends its route with a sequential id, very similar to this one here.
In my case the challenge is I'm iterating through several routes and I don't know the endpoint, so I don't know when to stop looping through the series.
My attempted remedy could be found in the below code sample. Here you see my soup function that makes a request to the url and returns the soup when the request returns a good status (200).
The next function is a binary search that loops through the range I give with a while
loop which stops running when I get a last successful URL request.
import requests from bs4 import BeautifulSoup #### # # Souped Up just makes the request and passes you the soup to parse when there is one available. Just pass the URL. # #### def soupedUp(url): theRequest = requests.get(url, allow_redirects=False) if theRequest.status_code == 200: soup = BeautifulSoup(theRequest.text, "lxml") else: soup = None return soup def binarySearch(theRange): first = 0 last = len(theRange)-1 while first <= last: middle = (first + last) // 2 if soupedUp(url+str(middle)) is None: last = middle - 1 else: first = middle + 1 return middle url = 'http://cornpalace.com/gallery.aspx?PID=' print(binarySearch(range(1,10000000)))
My functions are working, but I feel like there may be a faster, simpler or cleaner approach to find the last route in this URL.
Does anyone have a simpler way to handle this sort of looping through URLs while scraping the same pattern URL?
I'd be happy to see another approach at this, or any python modules that already offer this type of URL probing.