Use top-level functions
Reading URLs from a file and checking each of them for vacancies is done at the top-level of the file. You should enclose this code into functions and put the remaining top-level code under if __name__ == '__main__'
to:
- separate the intents;
- reuse the code more easily;
- not have random code execution when
import
ing the module.
Use filter
You are only interested in knowing which URL have vacancies. The easiest way to perform that is to:
interesting_urls = filter(has_vacancies, url)
with the proper definition of has_vacancies
.
You also only print 'Has vacancies'
when a given URL has so, letting the user try to figure out which one actually has. You might want to print the URL instead to make your script more usable:
print 'Following URLs have vacancies' for url in filter(has_vacancies, urls): print url
Use generators
There is no need to store all the URLs you’ll be checking up-front. Yield them from a generator since only one is processed at a time:
def read_urls(filename): with open(filename, 'rb') as spreadsheet: reader = csv.reader(spreadsheet) for row in reader: yield row[0].strip()
It also allows you to parametrize with the name of the file to read, in case you have several of them.
Do not force OOP when not required
You are not interested in storing a state about each URL. You just want a simple question answered: "does this URL points to a page that provide vacancies?" Thus a simple function will do, for each URL:
def has_vacancies(url): source = urllib2.urlopen(url).read() return not(any(text for text in NO_VACANCIES if text in source.lower()))
See what I did there? This is the exact function we need for the filter
I introduced earlier.
Remove extraneous computation
Calling lower()
on a string has its cost. Calling it once for every piece of text your are looking for in said string is even more costly. You should call it only once after fetching the source of the URL.
Of a lower interest, any
is converting pieces of text to their truthy value after a test text in source
already returned True
. You also store source
, self.url
, self.source
(and, to a lesser extend, page
) without really needing it. But I’ve shown ways to avoid it.
Proposed improvements
import csv import urllib2 NO_VACANCIES = ['no vacancies', 'not hiring'] def has_vacancies(url): source = urllib2.urlopen(url).read().lower() return not any(text in source for text in NO_VACANCIES) def read_urls(filename): with open(filename, 'rb') as spreadsheet: reader = csv.reader(spreadsheet) for row in reader: yield row[0].strip() def check_vacancies(filename): print 'Following URLs have vacancies' for url in filter(has_vacancies, read_urls(filename)): print url if __name__ == '__main__': check_vacancies('25.csv')
Going further
You might be interested in trying the re
module to perform text matches, there may be a performance improvement.
Alternatively, the multiprocessing
module provide a map_async
performed in parallel by a pool of workers. You might want to change has_vacancies
a bit and use that instead of filter
to increase overall performances.
You may be interested into handling exceptions from urllib2
so a sudden networking failure or an invalid URL won't crash everything. Something simple can do:
def has_vacancies(url): try: source = urllib2.urlopen(url).read() except urllib2.URLError: print 'Error processing', url return False else: source = source.lower() return not any(text in source for text in NO_VACANCIES)
requests
module?\$\endgroup\$