I'm implementing a trie structure for a library, as an exercise in data structures.
Details
The tree structure is represented as a doubly chained tree. A struct trie
represents a node of the trie, and contains the following:
key
is the character represented by the node. Since keys are expected to be\0
-terminated strings, a node with a key value of\0
is a leaf node.sibling
is the next sibling of the node.child
is the first child of the node.val
at a leaf node is the value associated with the key string that terminates at that node. This is expected to beNULL
on internal nodes.
The following methods are implemented:
trie_init
initialises and returns a trie.trie_insert
inserts valueval
into triet
with keyk
.trie_finsert
makes a copy of the firstsize
bytes afterval
to somenew_val
, then addsnew_val
to triet
with keyk
. If there is already a value associated withk
, a destructor functionf
is called on the old value before it is overwritten.trie_remove
removes the value associated with keyk
from triet
.trie_fremove
callsf
on the value associated with keyk
in triet
before removing it.trie_seek
searches triet
for a value associated with keyk
. Returns it if it is found, returnsNULL
otherwise.trie_match
searches triet
for all values associated with keys that have prefixk
, and appends them to a vectorv
in alphabetic order (for instance, doingtrie_match(t, “te”, v)
on the trie in the example as it appears just before it is destroyed will add”TWO”
and”EIGHT”
tov
, in that order.)trie_list
is just shorthand fortrie_match(t, “”, v)
and appends all the elements int
tov
.trie_dump
prints the structure of the trie.trie_dumps
prints the structure of the trie, and also prints the values interpreting them as pointers to\0
-terminated strings.trie_free
destroys a trie.trie_ffree
callsf
on each of the values in the trie and then destroys the trie.
Please note that the vector operations in trie_match
and trie_list
depend on another part of the same library in the project. As it is not crucial most of the issues regarding which I wish to seek advice in this question, I’ve decided to comment them out in the code presented here.
Questions
Aside from critique on general style/practices, I wish to seek advice specifically on the following topics:
Interface design(1)
Most of the functions return a pointer to the trie upon completion, or NULL
on error. I don't have very compelling reasons to do this - it was originally because the recursive versions of some functions returned the subtree they were operating on, and they were kept after I put some of it into loops. This has the following effects:
Allows for usage such as
trie_insert(trie_insert(t, "str1", val1), "str2", val2)
, but doing that may not be a very good idea since one cannot check for errors that occur some layers down.Possibly encourages potentially misleading expressions such as
t1 = trie_sinert(t2, "str", val)
, which looks liket1
being a copy oft2
with an extra key-value pair added, but instead it just mutatest2
and pointst1
to it.
Would it then be preferable to return, for instance, integer constants that indicate success/failure?
Furthermore, in some cases, signalling errors by returning NULL
looks questionable: for instance, cases in which one wishes to distinguish between the cases of having a key-value pair "str"
-NULL
and having no value mapped to by "str"
. In both cases the return value for "trie_seek(t, "str")
is the same. Would the inability to make that distinction be a concern?
Interface design(2)
Currently two sets of functions are provided, as described above.
Originally, the intention was for the caller to choose to either always use functions from one or the other set on the same trie, depending on circumstances (for instance, they should use trie_insert
when they are using a trie to index a collection of structures, the contents of which expecting to be edited over time; also, when using many tries to maintain multiple ways to index/search through a single set of values.)
However, there is no way to prevent a caller from mixing the two families of functions on the same trie. I can’t think of any reason a caller would want to do this, but in the case that they do, neither trie_free
or trie_ffree
would be a proper destructor of the trie (one would be a segfault and the other a memory leak). This is obviously not very elegant, and I suspect introducing checks for this purpose would just make it messier.
I’ve been considering keeping only the f-prefixed set of functions. In cases where the caller would prefer to work with references, where they would’ve done trie_insert(t, “str”, val)
they would instead do trie_finsert(t, “str”, &val, sizeof(val), &free)
. This seems to solve many of the aforementioned questions, the single obvious drawback being somewhat cumbersome syntax and the need to juggle pointers to pointers a lot.
Would either the current interface, or the suggested alternative, be a sensible design?
Error management
the code includes some snippets that (attempt to) handle out-of-memory cases, mostly a check of return values of functions that allocate memory, an in case of failure, printing to stderr
, doing some cleanup, then returning a suitable value to indicate that some operation has failed. In its present state it doesn’t seem to interfere with readability that much, but I worry it may become undesirable once I perhaps move on to more complicated structures.
I wish to know whether these OOM-checks conform to common practices, and whether there will perhaps be better ways to do them.
trie.h
#ifndef DATAH_TRIE_H #define DATAH_TRIE_H #include <stdlib.h> //#include "vector.h" struct trie { char key; void *val; struct trie *sibling; struct trie *child; }; struct trie* trie_init(void); struct trie* internal_trie_insert(struct trie *t, const char *k, void *val, void (*f)(void *val)); struct trie* trie_insert(struct trie *t, const char *k, void *val); struct trie* trie_finsert(struct trie *t, const char *k, const void *val, size_t size, void (*f)(void *val)); struct trie* trie_remove(struct trie *t, const char *k); struct trie* trie_fremove(struct trie *t, const char *k, void (*f)(void *val)); struct trie* internal_trie_remove(struct trie *t, const char *k, void (*f)(void *val)); void* trie_seek(struct trie *t, const char *k); struct vector* trie_match(struct trie *t, const char *k, struct vector *v); struct vector* trie_list(struct trie *t, struct vector *v); struct vector* internal_trie_match(struct trie *t, struct vector *v); void trie_free(struct trie *t); void trie_ffree(struct trie *t, void (*f)(void *val)); void internal_trie_free(struct trie *t, void (*f)(void *val)); void trie_dump(struct trie *t); void internal_trie_dump(struct trie *t, int depth, int s); void trie_dumps(struct trie *t); #endif
trie.c
#include <stdio.h> #include <stdlib.h> #include <string.h> #include "trie.h" //#include "vector.h" struct trie* trie_init(void) { struct trie *t = malloc(sizeof(struct trie)); if (!t) { fprintf(stderr, "error: trie_init: out of memory.\n"); return NULL; } t->key = 0; t->val = NULL; t->sibling = t->child = NULL; return t; } struct trie* trie_insert(struct trie *t, const char *k, void *val){ return internal_trie_insert(t, k, val, NULL); } struct trie* trie_finsert(struct trie *t, const char *k, const void *val, size_t size, void (*f)(void *val)){ void *new_val = malloc(size); if (!new_val) { fprintf(stderr, "error: trie_finsert: out of memory.\n"); return NULL; } memcpy(new_val, val, size); return internal_trie_insert(t, k, new_val, f); } struct trie* internal_trie_insert(struct trie *t, const char *k, void *val, void (*f)(void *val)){ struct trie *orig_t = t; while (*k) { // move along the sibling chain for (; t->sibling && (t->sibling->key <= *k); t = t->sibling); if (t->key != *k) { struct trie *new_t = malloc(sizeof(struct trie)); // if an OOM occurs, the function returns NULL // immediately. Any previously added nodes are left // in place, but relevant functions have been written // to ignore them. if (!new_t) { fprintf(stderr, "error: " "internal_trie_insert: out of memory.\n"); return NULL; } new_t->key = *k; new_t->val = NULL; new_t->sibling = t->sibling; t->sibling = new_t; t = new_t; } ++k; // move down one level if ((!t->child) || t->child->key > *k) { struct trie *new_t = malloc(sizeof(struct trie)); if (!new_t) { fprintf(stderr, "error: " "internal_trie_insert: out of memory.\n"); return NULL; } new_t->key = *k; new_t->val = NULL; new_t->sibling = t->child; t->child = new_t; } t = t->child; } if (t->val && f) { (*f)(t->val); } t->val = val; return orig_t; } struct trie* trie_remove(struct trie *t, const char *k){ return internal_trie_remove(t, k, NULL); } struct trie* trie_fremove(struct trie *t, const char *k, void (*f)(void *val)){ return internal_trie_remove(t, k, f); } struct trie* internal_trie_remove(struct trie *t, const char *k, void (*f)(void *val)){ // leaf node case if (!(*k || t->key)) { if (f && t->val) { (*f)(t->val); } struct trie *ret = t->sibling; free(t); return ret; } // internal node case struct trie *ret, *prev_t; ret = prev_t = t; for (; t && (t->key < *k); prev_t = t, t = t->sibling); if (t && (t->key == *k) && t->child && !(t->child = internal_trie_remove(t->child, k+1, f))) { prev_t->sibling = t->sibling; // have (prev_t == t) here if the first node in the present // layer has no children, in which case we connect its // sibling with its parent layer instead (in all other cases // the connection to parent layer is unchanged). if (prev_t == t) { ret = t->sibling; } free(t); } return ret; } void* trie_seek(struct trie *t, const char *k){ while (t && *k) { for (; t && (t->key < *k); t = t->sibling); if (t && (t->key == *k)) { t = t->child; ++k; } else { return NULL; } } return t->val; } struct vector* trie_match(struct trie *t, const char *k, struct vector *v){ if (!v) { /*v = vector_init();*/ } if (!v) { fprintf(stderr, "error: trie_match: invalid vector.\n"); return NULL; } while (t && *k) { for (; t && (t->key < *k); t = t->sibling); if (t && (t->key == *k)) { t = t->child; ++k; } else { return v; } } if (!t) { return v; } return internal_trie_match(t, v); } struct vector* trie_list(struct trie *t, struct vector *v){ return trie_match(t, "", v); } struct vector* internal_trie_match(struct trie *t, struct vector *v){ if (!v) { return NULL; } // v will be edited only if vector_push fails, in which case v becomes // NULL, and further iteration below in this function does not happen. // this value propagates to the caller and signals error. if (t->val) { /*v = vector_push(v, t->val);*/ } if (t->child) { internal_trie_match(t->child, v); } if (t->sibling) { internal_trie_match(t->sibling, v); } return v; } void trie_free(struct trie *t){ internal_trie_free(t, NULL); } void trie_ffree(struct trie *t, void (*f)(void *val)){ internal_trie_free(t, f); } void internal_trie_free(struct trie *t, void (*f)(void *val)){ if (t->child) { internal_trie_free(t->child, f); } if (t->sibling) { internal_trie_free(t->sibling, f); } if (t->val && f) { (*f)(t->val); } free(t); } void trie_dump(struct trie *t){ internal_trie_dump(t, 0, 0); } void trie_dumps(struct trie *t){ internal_trie_dump(t, 0, 1); } void internal_trie_dump(struct trie *t, int depth, int s){ for (int i = 0; i < depth+1; ++i) { putchar('-'); } putchar(t->key); if (t->val && s) { printf(": %s (%p)", (char*)t->val, t->val); } else if (t->val) { printf(": %p", t->val); } printf("\n"); if (t->child) { internal_trie_dump(t->child, depth+1, s); } if (t->sibling) { internal_trie_dump(t->sibling, depth, s); } }
test.c
#include <stdio.h> #include "trie.h" int main(int argc, char **argv){ struct trie *t = trie_init(); trie_finsert(t, "to", "ONE", 4, &free); trie_finsert(t, "tea", "TWO", 4, &free); trie_finsert(t, "ted", "THREE", 6, &free); trie_finsert(t, "ten", "FOUR", 5, &free); trie_finsert(t, "i", "FIVE", 5, &free); trie_finsert(t, "in", "SIX", 4, &free); trie_finsert(t, "inn", "SEVEN", 6, &free); trie_finsert(t, "ten", "EIGHT", 6, &free); trie_dumps(t); trie_fremove(t, "ted", &free); trie_fremove(t, "in", &free); putchar('\n'); trie_dumps(t); trie_ffree(t, &free); return 0; }
output of test.c
- -i --: FIVE (0x21f32a0) --n ---: SIX (0x21f3320) ---n ----: SEVEN (0x21f33a0) -t --e ---a ----: TWO (0x21f30f0) ---d ----: THREE (0x21f31a0) ---n ----: EIGHT (0x21f3420) --o ---: ONE (0x21f3040) - -i --: FIVE (0x21f32a0) --n ---n ----: SEVEN (0x21f33a0) -t --e ---a ----: TWO (0x21f30f0) ---n ----: EIGHT (0x21f3420) --o ---: ONE (0x21f3040)