Checking for duplicate values in database with Python

Question

I'm working on a Python application where I read and extract data from an HTML file. The data is stored in a list, and the number of items in a list is, on average, 50,000+. The items from the list are added to a database. I cannot allow duplicate datetime values in one of the columns, and so, I have to check if a specific value exists in the database already, and if it does, keep incrementing it until there's no duplicates.

Here's what my code looks like (I'm using SQLAlchemy):

import re from bs4 import BeautifulSoup from datetime import datetime, timedelta from models import db, Notes def import_file(filename, user): file = open(filename, 'r') html = file.read() file.close() soup = BeautifulSoup(html, "lxml") for link in soup.find_all('a'): validurl = re.compile( r'^(?:[a-z0-9\.\-]*)://' r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}(?<!-)\.?)|' r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|' r'\[?[A-F0-9]*:[A-F0-9:]+\]?)' r'(?::\d+)?' r'(?:/?|[/?]\S+)$', re.IGNORECASE) url = validurl.match(link.get('href')) if url: url = link.get('href') if len(url) <= 2000: if link.get('add_date'): date = datetime.utcfromtimestamp(int(link.get('add_date'))) while Notes.query.filter_by(user=user, added_on=date).first(): date += timedelta(0, 1) title = link.string import_notes = Notes() import_notes.main_url = url import_notes.title = title import_notes.user = user import_notes.added_on = date db.session.add(import_notes) db.session.commit()

However, this code runs very slowly. I calculated how long it was taking per insert, and it came out at exactly 13 seconds per insert. I ran a profiler, and the filter_by query was taking the majority of the time. I figured it was because the script had to run a query so many times per item, and so I changed my code to use a static list instead:

import re from bs4 import BeautifulSoup from datetime import datetime, timedelta from models import db, Notes def import_file(filename, user): file = open(filename, 'r') html = file.read() file.close() soup = BeautifulSoup(html, "lxml") for link in soup.find_all('a'): validurl = re.compile( r'^(?:[a-z0-9\.\-]*)://' r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}(?<!-)\.?)|' r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|' r'\[?[A-F0-9]*:[A-F0-9:]+\]?)' r'(?::\d+)?' r'(?:/?|[/?]\S+)$', re.IGNORECASE) url = validurl.match(link.get('href')) if url: url = link.get('href') if len(url) <= 2000: if link.get('add_date'): date = datetime.utcfromtimestamp(int(link.get('add_date'))) while Notes.query.filter_by(user=user, added_on=date).first(): date += timedelta(0, 1) title = link.string import_notes = Notes() import_notes.main_url = url import_notes.title = title import_notes.user = user import_notes.added_on = date db.session.add(import_notes) db.session.commit()

I thought this might make the code run faster, but there was little to no change in runtime. For comparison, if I remove the duplicate check, I get ~300 inserts per second.

I cannot let duplicate datetime values into the database. What can I do to make my code faster?

Why do you have that requirement about no multiple dates though? — ferada, CommentedFeb 13, 2015 at 14:32
I am using the dates for pagination. I can't use ids since I allow imports, and the dates on imported items can be older, but the id would then be higher. If I have too many duplicate dates on a page, then the forward/backward buttons stop working(ie, they redirect to the same page). — Nik Barres, CommentedFeb 13, 2015 at 14:43
Maybe you should try using a set, checking if a value is inside a set is extremely fast. — Caridorc, CommentedFeb 13, 2015 at 18:59
@Caridorc your suggestion to use a set solved my problem. Down from 13 seconds per insert to 0.009. If you add your comment as an answer, I'll accept it. — Nik Barres, CommentedFeb 14, 2015 at 12:38

Caridorc · Accepted Answer · 2015-02-14 12:59:38Z

In python checking membership of an item in a set is extremely fast.

I suggest you use sets.

Community · Accepted Answer · 2017-04-13 12:40:52Z

In addition to @Caridorc's answer about using sets instead of lists, a few more things beg for improvement.

Working with files

Use with ... as ... when working with files, to make sure that you cannot forget closing them. So instead of this:

file = open(filename, 'r') html = file.read() file.close()

Write like this:

with open(filename) as fh: html = fh.read()

(I also omitted the mode parameter of open, as 'r' is the default.

Move non-changing calculation out of a loop

In this loop, validurl is the same in every iteration:

for link in soup.find_all('a'): validurl = re.compile( r'^(?:[a-z0-9\.\-]*)://' r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}(?<!-)\.?)|' r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|' r'\[?[A-F0-9]*:[A-F0-9:]+\]?)' r'(?::\d+)?' r'(?:/?|[/?]\S+)$', re.IGNORECASE) url = validurl.match(link.get('href'))

It should be done just once before the loop begings:

validurl = re.compile( r'^(?:[a-z0-9\.\-]*)://' r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}(?<!-)\.?)|' r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|' r'\[?[A-F0-9]*:[A-F0-9:]+\]?)' r'(?::\d+)?' r'(?:/?|[/?]\S+)$', re.IGNORECASE) for link in soup.find_all('a'): url = validurl.match(link.get('href'))

Indentation

Indentation in Python is very important, and so it's important to do it consistently. For example this looks suspicious:

if url: url = link.get('href') if len(url) <= 2000: if link.get('add_date'):

In most of the program you correctly indented with 4 spaces, but after if url there you indented by 8 spaces. This is confusing, and it should be corrected.

There's a typo: html = file.read() should really be html = fh.read(). Also, you could mention that file is a reserved word in python, a synonym of open. And that's why you changed the variable name to fh. — user1016274, CommentedNov 8, 2015 at 21:16
Thanks for that! I fixed the typo. As for the reserved word... Is it really? In Python 2.x it's a class. In Python 3 it doesn't look like a reserved word. I could object about Python 2, but the version used is not clear, and since using Python 3 is recommended anyway, I don't feel it's important enough to bring it up. If you think otherwise, I'm open to suggestions — janos, CommentedNov 8, 2015 at 21:25
Agreed, all points valid. The distinction between a function and a class (constructor) wasn't clear to me. I'd avoid using this name for clarity, anyway. — user1016274, CommentedNov 8, 2015 at 21:34

Stack Exchange Network

Checking for duplicate values in database with Python

2 Answers 2

Working with files

Move non-changing calculation out of a loop

Indentation

Hot Network Questions

Checking for duplicate values in database with Python

2 Answers 2

Working with files

Move non-changing calculation out of a loop

Indentation

Related

Hot Network Questions