@LokiAstari hit pretty much all the points I'd have made, but I'd like to expand on one point and then make one philosophical point:
The use of std::map
is fine. But it does have O(log(n))
lookup.
std::map
is the C++ equivalent of Java's TreeMap. It's big (it stores about three pointers per node) and slow (lookups require following log(n)
of those pointers on average, and inserts can require rebalancing of the red-black tree, thus touching even more pointers) and therefore cache-unfriendly (a single tree lookup can dirty up to log(n)
cache lines, none of which will be helpful for the next tree lookup) and therefore even slower.
And you're not just using onestd::map
— you're using one std::map
per node in your trie, i.e., you're making a big pointer-based tree of big pointer-based trees. This is just absolutely awful for performance.
So yes, an array spNode *children[256]
would probably be preferable. Or it might be worth the effort to abstract out a data type like
template<typename Key, typename Value> struct FastMapForTries { static_assert(std::is_same_v<Key, char>, "Key type must be char"); Value favorite_children['z' - 'A' + 1]; std::map<Key, Value> other_children; Value& operator[](const Key& k) const { ... } };
just to reduce the size of a Trie::Node
from 257*8
back down to 59*8 + sizeof(std::map)
.
If you're sticking with the std::map
, then at least you should avoid looking things up in the map repeatedly.
if (temp->children.find(c) == temp->children.end()) {//if char not in map std::unique_ptr<Node> node(new Node()); temp->children[c] = std::move(node); } temp = temp->children[c].get();
I count three tree lookups in this short snippet. Try this instead:
auto it = temp->children.find(c); if (it == temp->children.end()) { // if char not in map it = temp->children.emplace_hint(it, c, std::make_unique<Node>()); } temp = it;
Now there's only two lookups. — You might say "only one lookup", but look again and notice (as I missed the first time I wrote this code) that emplace_hint
is only ever called with end()
as the hint! So using emplace_hint
here is exactly as inefficient as using unhinted emplace
. A really good static analysis tool might have caught this bug of mine.
So let's do this instead:
auto pair = temp->children.insert(c, nullptr); temp = pair.first; if (pair.second) { // if we just inserted nullptr *temp = std::make_unique<Node>(); }
Now there's only one tree lookup! If you have benchmarks for your trie code, you should see a ~3x speedup from just this one little refactor.
@Loki mentioned your awkwardly named trie.search(word)
method; he suggested trie.find(word)
. But I'd say that find
isn't the right word either. find
in the STL denotes "look for and return an iterator to", and (right now) your trie doesn't have iterators. So right now, the most STL-ish name for your method is trie.count(word)
, and it should return 0
or 1
.
Of course if you're not going for absolute STL-ish-ness, bool trie.contains(word)
would be the most user-friendly name.
But here's the philosophical point: When designing a data structure, you should treat observations such as "find
is the wrong name because find
returns an iterator" as questions to be seriously considered and responded to!
- I'd like to use the name
find
... - But
find
needs to return an iterator... - Is my trie iterable? No. Can one iterate a trie?...
- Yes, I suppose so (if we introduce parent pointers). An iterator would naturally look like a pointer to a leaf of the trie. Should I make my trie iterable?...
- Hmm. What else can one do with a pointer to a leaf of the trie? What operations are afforded by a pointer-to-leaf?...
- How about fast insertion of a suffix? For example, if I had a pointer to the last character of
"dog"
, I could quickly append a child-leaf to make "dogs"
. - Huh! That sounds a lot like the
emplace_hint
method we were just using! - What about a
lower_bound
method to go along with it?
...and suddenly we're inventing a rich and useful data structure that wasn't even on our radar at the beginning of the conversation. So, always take API-design questions seriously! :)
hash-map
here.std::map
has the same characteristics as a tree.std::unordered_map
is closer to ahash-map
but I believe has a much larger space requirements.\$\endgroup\$