I'm working on a Python application where I read and extract data from an HTML file. The data is stored in a list, and the number of items in a list is, on average, 50,000+. The items from the list are added to a database. I cannot allow duplicate datetime values in one of the columns, and so, I have to check if a specific value exists in the database already, and if it does, keep incrementing it until there's no duplicates.
Here's what my code looks like (I'm using SQLAlchemy):
import re from bs4 import BeautifulSoup from datetime import datetime, timedelta from models import db, Notes def import_file(filename, user): file = open(filename, 'r') html = file.read() file.close() soup = BeautifulSoup(html, "lxml") for link in soup.find_all('a'): validurl = re.compile( r'^(?:[a-z0-9\.\-]*)://' r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}(?<!-)\.?)|' r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|' r'\[?[A-F0-9]*:[A-F0-9:]+\]?)' r'(?::\d+)?' r'(?:/?|[/?]\S+)$', re.IGNORECASE) url = validurl.match(link.get('href')) if url: url = link.get('href') if len(url) <= 2000: if link.get('add_date'): date = datetime.utcfromtimestamp(int(link.get('add_date'))) while Notes.query.filter_by(user=user, added_on=date).first(): date += timedelta(0, 1) title = link.string import_notes = Notes() import_notes.main_url = url import_notes.title = title import_notes.user = user import_notes.added_on = date db.session.add(import_notes) db.session.commit()
However, this code runs very slowly. I calculated how long it was taking per insert, and it came out at exactly 13 seconds per insert. I ran a profiler, and the filter_by query
was taking the majority of the time. I figured it was because the script had to run a query so many times per item, and so I changed my code to use a static list instead:
import re from bs4 import BeautifulSoup from datetime import datetime, timedelta from models import db, Notes def import_file(filename, user): file = open(filename, 'r') html = file.read() file.close() soup = BeautifulSoup(html, "lxml") for link in soup.find_all('a'): validurl = re.compile( r'^(?:[a-z0-9\.\-]*)://' r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}(?<!-)\.?)|' r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|' r'\[?[A-F0-9]*:[A-F0-9:]+\]?)' r'(?::\d+)?' r'(?:/?|[/?]\S+)$', re.IGNORECASE) url = validurl.match(link.get('href')) if url: url = link.get('href') if len(url) <= 2000: if link.get('add_date'): date = datetime.utcfromtimestamp(int(link.get('add_date'))) while Notes.query.filter_by(user=user, added_on=date).first(): date += timedelta(0, 1) title = link.string import_notes = Notes() import_notes.main_url = url import_notes.title = title import_notes.user = user import_notes.added_on = date db.session.add(import_notes) db.session.commit()
I thought this might make the code run faster, but there was little to no change in runtime. For comparison, if I remove the duplicate check, I get ~300 inserts per second.
I cannot let duplicate datetime values into the database. What can I do to make my code faster?