Python Hash Table Implementation

Question

I'm attempting to implement a basic hash table in Python using only lists. Any tips would be appreciated (including better hash functions). I intend this to handle collisions with separate chaining.

Are there any features I didn't implement that are standard in hash tables?
Are there any things I handled incorrectly or could have implemented in a better way?

My implementation:

class HashTable(object): table = [None] * 256 def get_value(self, key): total = 0 for i in range(len(key)): total += ord(key[i]) * (7**i) return (len(key) * total) % 256 def insert(self, key): val = self.get_value(key) if self.table[val] == None: self.table[val] = key else: if type(self.table[val]) == list: self.table[val].append(key) else: self.table[val] = [self.table[val], key] def delete(self, key): val = self.get_value(key) if self.table[val] != None: if type(self.table[val]) == list: i = self.table[val].index(key) self.table[val][i] = None else: self.table[val] = None else: KeyError() def lookup(self, key): found = False val = self.get_value(key) if type(self.table[val]) == list: found = key in self.table[val] else: found = self.table[val] == key return found

Note: This works on both Python 2 and should work in Python 3

@Quill This was written in Python 2.7, although I'm not sure I used any features that aren't in Python3 either. — Brandon Deo, CommentedJan 28, 2016 at 5:48
@drexx Generally people will tag both Python-3.x and Python-2.7 if they're looking to make code suitable for both cases, it's worth also saying that's a goal in the question so that people know to examine that. — SuperBiasedMan, CommentedFeb 1, 2016 at 9:57
@SuperBiasedMan Use this tag along with the main python tag to denote programs that are meant to be run on a Python 2 interpreter only. Do not mix this tag with the python-3.x tag.: codereview.stackexchange.com/edit-tag-wiki/3534 — dfhwze, CommentedAug 3, 2019 at 16:51

J_H · Accepted Answer · 2017-09-15 15:09:09Z

Looks nice enough, pretty straightforward, except that it implements a MultiSet rather than the stated Dictionary mapping. I agree with ppperry that simplifying get_value() by using hash() makes sense. If reimplementing is necessary, I'd probably go with java's hash function just because it is well documented, well understood, and known not to have ugly collision pathologies. If you really want to keep your own, then the 7 ** i expression seems more expensive than necessary. Try maintaining an accumulator that starts at 1 and gets acc *= 7 each time through the loop, it may bench faster.

In insert(), this seems like a regrettable design choice:

 if self.table[val] == None: self.table[val] = key

If you'd instead assigned [key] then some special casing simply disappears. For that matter, it would be useful to replace None with [].

This line:

 self.table[val] = [self.table[val], key]

winds up doing what I suggest, in a deferred way, so the complex invariant you're maintaining is "an entry is None or a single non-colliding key or a list of colliding keys", rather than an easily maintained invariant of "an entry is a list of zero or more keys". If the complexity is justified by reduced memory consumption due to fewer list objects, then make that design decision explicit in a comment.

Caller is prohibited from storing the reserved value None, but insert() doesn't have a comment that only strings (only sequences one can call ord() on) are acceptable. A bad input of None will be discovered one level down on the call stack, in get_value(). This is not so bad - calls involving illegal input will soon be debugged. But consider adding asserts for preconditions so that bugs are quickly isolated to the relevant layer, rather than potentially being propagated a few levels down the stack.

In lookup(), this is linear with number of collisions:

 if type(self.table[val]) == list: found = key in self.table[val]

You may as well replace list with set, for faster lookups. Curiously enough, insert() does not check for key prior to inserting key, so repeated insertions lead to a long (always colliding) list. So rather than Set, you wind up offering MultiSet (or Bag) semantics. You supplied no unit tests that show the behavior of dup key insertion.

In delete()

 i = self.table[val].index(key) self.table[val][i] = None

you accumulate lots of [foo, None, None, bar, None] trash after repeated insertions and deletions, leading to very long running times for lookup(). The .index() is linear, as with in above. It would be natural to del the i-th entry, rather than assign None to it. This also has linear cost, which again becomes smaller if you switch from list to set.

slallum · Accepted Answer · 2016-01-28 08:03:34Z

I'm pretty sure you implemented the necessary features in a standard hash table. If you want the hash table to populate only unique values then you need to change your insert method by looking up the value before inserting it.

Here are some things you did incorrectly or can do better in my opinion:

table = [None] * 256

table is static right now, which means that any instance of the class will have the same table variable. You need to initiate it in __init__.

def get_value(self, key):

get_value is a method that should not be called by the user of the class. I recommend making is private by changing its name to _get_value.

def insert(self, key): val = self.get_value(key) if self.table[val] == None: self.table[val] = key else: if type(self.table[val]) == list: self.table[val].append(key) else: self.table[val] = [self.table[val], key]

Why starting with a single value and only after that changing to a list? I recommend starting with a list right from the start. As said by python's module this:

Special cases aren't special enough to break the rules.

That way, you can start with a table of empty lists right from the start, this will simplify your insert and lookup methods.

def delete(self, key): val = self.get_value(key) if self.table[val] != None: if type(self.table[val]) == list: i = self.table[val].index(key) self.table[val][i] = None

Making the value None can be dangerous - what if the user inserts lot's of values and then removes all of them? the lookup method will then take much more time than required. Although it takes more time, I still think list.remove is the right thing to do here.

 ... else: KeyError()

You need to raise the KeyError. Also - it's better to put in the error the right message. something like: raise KeyError("key {key} can't be found.".format(key=key)

SuperBiasedMan · Accepted Answer · 2016-02-01 10:03:52Z

I agree with slallum that you ought to just use lists from the start, but I would also note that this is not how you test for a list:

if type(self.table[val]) == list:

In general type(var) == type is discouraged, because it wont accept inheritance. The most obvious example is that a DefaultDict is a special type of dictionary. It has some extra functionality but mostly is the same as a regular dictionary. Yet if you tried this test, you would be told that they're different types. The way to compare type directly is to use the function isinstance. Which will test if a variable is an instance of a type, or any type that inherits from it:

if isinstance(self.table[val], list):

But generally, it's more Pythonic to test something based on whether or not it can do what you need. For instance, you might want to attempt to append to the value, and if it cannot take an append then it must not be a list and you should instead create one. Like so:

try: self.table[val].append(key) except AttributeError: self.table[val] = [self.table[val], key]

This will attempt to append to the list, but if it cannot that means that it's not a list and will instantiate one there.

pppery · Accepted Answer · 2016-06-02 00:00:03Z

Python already has a built-in hash function, so you can simplify your get_value method to this:

def get_value(self, key): return hash(key)%256

As a bonus, your hash table now works with more than just strings.

The get_value method does not do anything with the HashTable instance, so it should be moved out of the class:

def get_value(key): return hash(key)%256 class HashTable(object): ...

The built in dict type in Python will automatically get bigger as you add more stuff, unlike your HashTable class.

oliverpool · Accepted Answer · 2016-01-28 08:05:28Z

A first step could be to add docstrings to your methods: it helps your readers understand the goal of your methods and it help yourself know what those methods should and should not do.

Then you should check the user input: I suppose, that the key variable is a string. You should enforce it and raise a TypeError when it is not the case (for example a list of strings should not be accepted).

I would rename the get_value method to get_hash, to show that it computes a hash between 0 and 255.

You currently store the actual values either directly or in a list. I think you should stick to one data structure: a list, or even better a set (it has add and remove method just as you need).

You could also use the same conventions as the set for your public method naming: add, remove (and I would suggest has instead of lookup).

You could also implement some of the python magic methods like __contains__(self, item) and __delitem__(self, key)

set is implemented by hash table, so it would be "cheating" using set instead of regular list. — slallum, CommentedJan 28, 2016 at 8:12
Fair enough. But one could also use a list as a hash table: elt in list, list.remove(elt) (with some methods to prevent duplication of the elements). I think we should add the [reinventing-the-wheel] tag ;-) — oliverpool, CommentedJan 28, 2016 at 8:17

Stack Exchange Network

Python Hash Table Implementation

5 Answers 5

Linked

Hot Network Questions

Python Hash Table Implementation

5 Answers 5

Linked

Related

Hot Network Questions