Sorting in linear time using Trie Data Structure

Question

I have written an article on a new way of non comparison sorting and I want it to get peer reviewed. The sorting algorithm uses Trie data structure to store digits of numbers as nodes and iterating over each node 10 times (k) in a depth first traversal returns the sorted list. Time complexity O(n log m), log to the base k given a list of 'n' numbers in base 'k' and 'm' digits in the highest number in the list. 'm' doesn't depend on 'n'. Kinda like radix sort but still different.

I'm pretty sure the solution can be improved a lot or not be legit at all, hence I am trying to reach out to people who can help me on this. Can you please take a look at it and give me your feedback?

import collections class TrieNode: def __init__(self): self.children = collections.defaultdict(TrieNode) self.count = 0 class Trie: def __init__(self): self.root = TrieNode() def insert(self, word): current = self.root for w in word: current = current.children[w] current.count += 1 class TrieSort: def __init__(self): self.res = [] self.max_num_len = 0 def buildTrie(self, nums): trie = Trie() self.max_num_len = len(str(max(nums))) # We get the highest number's max value of digits for num in nums: num_len = len(str(num)) zero_len = self.max_num_len - num_len trie.insert('0'* zero_len + str(num)) return trie.root def trieSort(self, node, word, depth): for i in range(0, 10): i = str(i) if not node.children.get(i): continue if depth == self.max_num_len-1: self.res += [int(word+i)] * node.children.get(i).count self.trieSort(node.children.get(i), word + i, depth +1) def sort(self, nums): trie = self.buildTrie(nums) # we start at root node with an empty word and at depth zero self.trieSort(trie, "", 0) return self.res ts = TrieSort() ts.sort([113, 20, 3999, 4])

Medium Article : https://medium.com/@jo07/triesort-in-linear-time-be1d07d41679

Github repo : https://github.com/jo07/Trie-Sort

Jo,

Cheers.

Try some benchmarks, report a reproducible crossover with a run-of-the-mill heapsort. — greybeard, CommentedMar 15, 2022 at 14:46
(I didn't seem to pay enough attention to see where comparisons were needed in "the Medium Article".) — greybeard, CommentedMar 15, 2022 at 14:49
(See also: Jomy Sebastian@CS: Sorting in linear time using Trie Data Structure.) — greybeard, CommentedMar 15, 2022 at 15:03

mdfst13 · Accepted Answer · 2022-04-15 05:54:24Z

If I'm understanding you properly, you are claiming that the time complexity is \$\mathcal{O}(n \log_k{m})\$ where \$n\$ is the number of items to sort, \$m\$ is the number of digits, and \$k\$ is the base. However, that's not what I get.

Building the trie is \$\mathcal{O}(nm)\$. For each item to be sorted (\$n\$), you have to iterate through the digits (\$m\$) to insert it. That's a simple multiplication.

Reading the trie is \$\mathcal{O}(k^m)\$. For each possible digit, you check if it has children (in the code as presented; comments speculate that this is avoidable but this code does not avoid it). And the trie is \$m\$ deep. That's a simple exponential.

Sorting requires both building and reading, so it's \$\mathcal{O}(nm + k^m)\$.

We can define \$m\$ in terms of \$n\$. It's going to be \$\mathcal{\Omega}(\log{n})\$, because the most items you can support is limited by the number of digits in the numbers. Another way to say that is \$n\$ is \$\mathcal{O}(k^m)\$. But then we have \$\mathcal{O}(n \log{n} + n)\$, which is just \$\mathcal{O}(n \log{n})\$. That assumes that we have the largest \$n\$ possible for a given \$k^m\$. I.e. the trie is completely full.

It's probably more accurate to describe it as \$\mathcal{O}(nm + k^m)\$. If the trie is sparse, the \$k^m\$ will dominate. If the trie is fully populated, it ends up as \$\mathcal{O}(n \log{n})\$. The problem here is that \$m\$ is always at least \$\log{n}\$ and may be larger. So when \$n\$ is as large as can be, it devolves to the same \$\mathcal{O}(n \log{n})\$ as the best comparison sorts. When \$n\$ is smaller, it is strictly worse than those comparison sorts (although it may still beat quadratic sorts like bubble sort).

Consider the case of a trie with just a single entry. It still needs \$\mathcal{O}(m)\$ insertion time. And reading time is still \$\mathcal{O}(mk)\$, because it has to check each potential digit for children. So "sorting" the single element will take \$\mathcal{O}(mk)\$ time. Meanwhile, \$\log{n}\$ is 0 when \$n\$ is 1.

Perhaps another example would be helpful to illustrate the problem. Instead of TrieSort, let's consider ArraySort. To sort, we will simply mark all the given items as true (defaulting the others to false) in a Boolean array of size \$k^m\$ and then iterate over it to read. "Insertion" involves some arithmetic, which takes \$\mathcal{O}(m)\$ time per element. So \$\mathcal{O}(nm)\$. And reading is linear in the size of the array. So sorting takes \$\mathcal{O}(nm + k^m)\$ time. That looks familiar.

Yes, this ArraySort has the same worst case complexity as the TrieSort. The TrieSort might manage to do better, as you don't have to build out all the \$k^m\$ entries every time. The trie can skip nodes, where the array can't. But when fully populated, both are the same.

The essential problem here is that the array and the trie both insert in \$\mathcal{O}(m)\$ time. Which when \$n\$ is large, devolves to \$\mathcal{O}(\log{n})\$ time. So the cost of creating your sorted list is the same using them as for a traditional loglinear comparison sort in the worst case and is worse in the better cases (when \$\log{n}\$ is smaller than \$m\$).

I have neglected duplicates. As you note for your trie, adding a count property would fix that. For the array, switching from a Boolean to an integer type would have the same effect. That won't affect the asymptotic analysis significantly. It might help in practice in that now \$m\$ is no longer guaranteed to be at least as large as \$\log{n}\$.

Stack Exchange Network

Sorting in linear time using Trie Data Structure

1 Answer 1

Hot Network Questions

Sorting in linear time using Trie Data Structure

1 Answer 1

Related

Hot Network Questions