9
\$\begingroup\$

I'm implementing a trie structure for a library, as an exercise in data structures.

Details

The tree structure is represented as a doubly chained tree. A struct trie represents a node of the trie, and contains the following:

  • key is the character represented by the node. Since keys are expected to be \0-terminated strings, a node with a key value of \0 is a leaf node.
  • sibling is the next sibling of the node.
  • child is the first child of the node.
  • val at a leaf node is the value associated with the key string that terminates at that node. This is expected to be NULL on internal nodes.

The following methods are implemented:

  • trie_init initialises and returns a trie.

  • trie_insert inserts value val into trie t with key k.

  • trie_finsert makes a copy of the first size bytes after val to some new_val, then adds new_val to trie t with key k. If there is already a value associated with k, a destructor function f is called on the old value before it is overwritten.

  • trie_remove removes the value associated with key k from trie t.

  • trie_fremove calls f on the value associated with key k in trie t before removing it.

  • trie_seek searches trie t for a value associated with key k. Returns it if it is found, returns NULL otherwise.

  • trie_match searches trie t for all values associated with keys that have prefix k, and appends them to a vector v in alphabetic order (for instance, doing trie_match(t, “te”, v) on the trie in the example as it appears just before it is destroyed will add ”TWO” and ”EIGHT” to v, in that order.)

  • trie_list is just shorthand for trie_match(t, “”, v) and appends all the elements in t to v.

  • trie_dump prints the structure of the trie.

  • trie_dumps prints the structure of the trie, and also prints the values interpreting them as pointers to \0-terminated strings.

  • trie_free destroys a trie.

  • trie_ffree calls f on each of the values in the trie and then destroys the trie.

Please note that the vector operations in trie_match and trie_list depend on another part of the same library in the project. As it is not crucial most of the issues regarding which I wish to seek advice in this question, I’ve decided to comment them out in the code presented here.


Questions

Aside from critique on general style/practices, I wish to seek advice specifically on the following topics:

Interface design(1)

Most of the functions return a pointer to the trie upon completion, or NULL on error. I don't have very compelling reasons to do this - it was originally because the recursive versions of some functions returned the subtree they were operating on, and they were kept after I put some of it into loops. This has the following effects:

  • Allows for usage such as trie_insert(trie_insert(t, "str1", val1), "str2", val2), but doing that may not be a very good idea since one cannot check for errors that occur some layers down.

  • Possibly encourages potentially misleading expressions such as t1 = trie_sinert(t2, "str", val), which looks like t1 being a copy of t2 with an extra key-value pair added, but instead it just mutates t2 and points t1 to it.

Would it then be preferable to return, for instance, integer constants that indicate success/failure?

Furthermore, in some cases, signalling errors by returning NULL looks questionable: for instance, cases in which one wishes to distinguish between the cases of having a key-value pair "str"-NULL and having no value mapped to by "str". In both cases the return value for "trie_seek(t, "str") is the same. Would the inability to make that distinction be a concern?

Interface design(2)

Currently two sets of functions are provided, as described above.

Originally, the intention was for the caller to choose to either always use functions from one or the other set on the same trie, depending on circumstances (for instance, they should use trie_insert when they are using a trie to index a collection of structures, the contents of which expecting to be edited over time; also, when using many tries to maintain multiple ways to index/search through a single set of values.)

However, there is no way to prevent a caller from mixing the two families of functions on the same trie. I can’t think of any reason a caller would want to do this, but in the case that they do, neither trie_free or trie_ffree would be a proper destructor of the trie (one would be a segfault and the other a memory leak). This is obviously not very elegant, and I suspect introducing checks for this purpose would just make it messier.

I’ve been considering keeping only the f-prefixed set of functions. In cases where the caller would prefer to work with references, where they would’ve done trie_insert(t, “str”, val) they would instead do trie_finsert(t, “str”, &val, sizeof(val), &free). This seems to solve many of the aforementioned questions, the single obvious drawback being somewhat cumbersome syntax and the need to juggle pointers to pointers a lot.

Would either the current interface, or the suggested alternative, be a sensible design?

Error management

the code includes some snippets that (attempt to) handle out-of-memory cases, mostly a check of return values of functions that allocate memory, an in case of failure, printing to stderr, doing some cleanup, then returning a suitable value to indicate that some operation has failed. In its present state it doesn’t seem to interfere with readability that much, but I worry it may become undesirable once I perhaps move on to more complicated structures.

I wish to know whether these OOM-checks conform to common practices, and whether there will perhaps be better ways to do them.


trie.h

#ifndef DATAH_TRIE_H #define DATAH_TRIE_H #include <stdlib.h> //#include "vector.h" struct trie { char key; void *val; struct trie *sibling; struct trie *child; }; struct trie* trie_init(void); struct trie* internal_trie_insert(struct trie *t, const char *k, void *val, void (*f)(void *val)); struct trie* trie_insert(struct trie *t, const char *k, void *val); struct trie* trie_finsert(struct trie *t, const char *k, const void *val, size_t size, void (*f)(void *val)); struct trie* trie_remove(struct trie *t, const char *k); struct trie* trie_fremove(struct trie *t, const char *k, void (*f)(void *val)); struct trie* internal_trie_remove(struct trie *t, const char *k, void (*f)(void *val)); void* trie_seek(struct trie *t, const char *k); struct vector* trie_match(struct trie *t, const char *k, struct vector *v); struct vector* trie_list(struct trie *t, struct vector *v); struct vector* internal_trie_match(struct trie *t, struct vector *v); void trie_free(struct trie *t); void trie_ffree(struct trie *t, void (*f)(void *val)); void internal_trie_free(struct trie *t, void (*f)(void *val)); void trie_dump(struct trie *t); void internal_trie_dump(struct trie *t, int depth, int s); void trie_dumps(struct trie *t); #endif 

trie.c

#include <stdio.h> #include <stdlib.h> #include <string.h> #include "trie.h" //#include "vector.h" struct trie* trie_init(void) { struct trie *t = malloc(sizeof(struct trie)); if (!t) { fprintf(stderr, "error: trie_init: out of memory.\n"); return NULL; } t->key = 0; t->val = NULL; t->sibling = t->child = NULL; return t; } struct trie* trie_insert(struct trie *t, const char *k, void *val){ return internal_trie_insert(t, k, val, NULL); } struct trie* trie_finsert(struct trie *t, const char *k, const void *val, size_t size, void (*f)(void *val)){ void *new_val = malloc(size); if (!new_val) { fprintf(stderr, "error: trie_finsert: out of memory.\n"); return NULL; } memcpy(new_val, val, size); return internal_trie_insert(t, k, new_val, f); } struct trie* internal_trie_insert(struct trie *t, const char *k, void *val, void (*f)(void *val)){ struct trie *orig_t = t; while (*k) { // move along the sibling chain for (; t->sibling && (t->sibling->key <= *k); t = t->sibling); if (t->key != *k) { struct trie *new_t = malloc(sizeof(struct trie)); // if an OOM occurs, the function returns NULL // immediately. Any previously added nodes are left // in place, but relevant functions have been written // to ignore them. if (!new_t) { fprintf(stderr, "error: " "internal_trie_insert: out of memory.\n"); return NULL; } new_t->key = *k; new_t->val = NULL; new_t->sibling = t->sibling; t->sibling = new_t; t = new_t; } ++k; // move down one level if ((!t->child) || t->child->key > *k) { struct trie *new_t = malloc(sizeof(struct trie)); if (!new_t) { fprintf(stderr, "error: " "internal_trie_insert: out of memory.\n"); return NULL; } new_t->key = *k; new_t->val = NULL; new_t->sibling = t->child; t->child = new_t; } t = t->child; } if (t->val && f) { (*f)(t->val); } t->val = val; return orig_t; } struct trie* trie_remove(struct trie *t, const char *k){ return internal_trie_remove(t, k, NULL); } struct trie* trie_fremove(struct trie *t, const char *k, void (*f)(void *val)){ return internal_trie_remove(t, k, f); } struct trie* internal_trie_remove(struct trie *t, const char *k, void (*f)(void *val)){ // leaf node case if (!(*k || t->key)) { if (f && t->val) { (*f)(t->val); } struct trie *ret = t->sibling; free(t); return ret; } // internal node case struct trie *ret, *prev_t; ret = prev_t = t; for (; t && (t->key < *k); prev_t = t, t = t->sibling); if (t && (t->key == *k) && t->child && !(t->child = internal_trie_remove(t->child, k+1, f))) { prev_t->sibling = t->sibling; // have (prev_t == t) here if the first node in the present // layer has no children, in which case we connect its // sibling with its parent layer instead (in all other cases // the connection to parent layer is unchanged). if (prev_t == t) { ret = t->sibling; } free(t); } return ret; } void* trie_seek(struct trie *t, const char *k){ while (t && *k) { for (; t && (t->key < *k); t = t->sibling); if (t && (t->key == *k)) { t = t->child; ++k; } else { return NULL; } } return t->val; } struct vector* trie_match(struct trie *t, const char *k, struct vector *v){ if (!v) { /*v = vector_init();*/ } if (!v) { fprintf(stderr, "error: trie_match: invalid vector.\n"); return NULL; } while (t && *k) { for (; t && (t->key < *k); t = t->sibling); if (t && (t->key == *k)) { t = t->child; ++k; } else { return v; } } if (!t) { return v; } return internal_trie_match(t, v); } struct vector* trie_list(struct trie *t, struct vector *v){ return trie_match(t, "", v); } struct vector* internal_trie_match(struct trie *t, struct vector *v){ if (!v) { return NULL; } // v will be edited only if vector_push fails, in which case v becomes // NULL, and further iteration below in this function does not happen. // this value propagates to the caller and signals error. if (t->val) { /*v = vector_push(v, t->val);*/ } if (t->child) { internal_trie_match(t->child, v); } if (t->sibling) { internal_trie_match(t->sibling, v); } return v; } void trie_free(struct trie *t){ internal_trie_free(t, NULL); } void trie_ffree(struct trie *t, void (*f)(void *val)){ internal_trie_free(t, f); } void internal_trie_free(struct trie *t, void (*f)(void *val)){ if (t->child) { internal_trie_free(t->child, f); } if (t->sibling) { internal_trie_free(t->sibling, f); } if (t->val && f) { (*f)(t->val); } free(t); } void trie_dump(struct trie *t){ internal_trie_dump(t, 0, 0); } void trie_dumps(struct trie *t){ internal_trie_dump(t, 0, 1); } void internal_trie_dump(struct trie *t, int depth, int s){ for (int i = 0; i < depth+1; ++i) { putchar('-'); } putchar(t->key); if (t->val && s) { printf(": %s (%p)", (char*)t->val, t->val); } else if (t->val) { printf(": %p", t->val); } printf("\n"); if (t->child) { internal_trie_dump(t->child, depth+1, s); } if (t->sibling) { internal_trie_dump(t->sibling, depth, s); } } 

test.c

#include <stdio.h> #include "trie.h" int main(int argc, char **argv){ struct trie *t = trie_init(); trie_finsert(t, "to", "ONE", 4, &free); trie_finsert(t, "tea", "TWO", 4, &free); trie_finsert(t, "ted", "THREE", 6, &free); trie_finsert(t, "ten", "FOUR", 5, &free); trie_finsert(t, "i", "FIVE", 5, &free); trie_finsert(t, "in", "SIX", 4, &free); trie_finsert(t, "inn", "SEVEN", 6, &free); trie_finsert(t, "ten", "EIGHT", 6, &free); trie_dumps(t); trie_fremove(t, "ted", &free); trie_fremove(t, "in", &free); putchar('\n'); trie_dumps(t); trie_ffree(t, &free); return 0; } 

output of test.c

- -i --: FIVE (0x21f32a0) --n ---: SIX (0x21f3320) ---n ----: SEVEN (0x21f33a0) -t --e ---a ----: TWO (0x21f30f0) ---d ----: THREE (0x21f31a0) ---n ----: EIGHT (0x21f3420) --o ---: ONE (0x21f3040) - -i --: FIVE (0x21f32a0) --n ---n ----: SEVEN (0x21f33a0) -t --e ---a ----: TWO (0x21f30f0) ---n ----: EIGHT (0x21f3420) --o ---: ONE (0x21f3040) 
\$\endgroup\$

    1 Answer 1

    2
    \$\begingroup\$

    I have some observations that may help you improve your program.

    Initialize all members

    The trie_insert function does not initialize the child pointer. This is an error because later functions test that member for NULL and if the value hadn't been explicitly set, there's no reason to believe that uninitialized members will have any particular value. I'd recommend creating a function that does this, such as trie_new(char key, struct trie *sibling); That makes a few of the other suggestions below easier to implement as well.

    Here's how it might be implemented with trie_init rewritten to use it:

    struct trie* trie_new(char k, struct trie *sibling) { struct trie *t = malloc(sizeof(struct trie)); if (!t) { fprintf(stderr, "error: trie_init: out of memory.\n"); return NULL; } t->key = k; t->val = t->child = NULL; t->sibling = sibling; return t; } struct trie* trie_init(void) { return trie_new('\0', NULL); } 

    Cleanly separate interface from implementation

    The names of the functions such as trie_insert and internal_trie_insert suggest a nice neat separation between interface and implementation, but all of the function prototypes are in the same trie.h header file. Instead, it would probably be better to only put the public functions in the header file. Then, within the .c file, order the functions such that additional functions are not needed and make them static to prevent calling them externally.

    Consider an alternative interface

    While I can understand how passing a free function into the trie might be useful, as for using a custom allocator, I'd suggest that it might be better to pass it in once via the trie_init function rather than various other places in the interface. This would mean that only similarly allocated val types could be accomodated, but I would guess that use of heterogenous collections of val types would be unusual in actual practice (and not completely correctly supported in the current code). In that same vein, it seems unlikely to me that both trie_dump and trie_dumps are useful.

    Handle out of memory conditions consistently

    Usually, this would be the place I point out that malloc could fail and that you should check the return value for NULL, but I'm pleased instead to note that you've done it well and consistently. I don't believe that the code will get too cluttered if the code is structured well.

    Provide complete code for review

    This isn't really so much a criticism as an observation. Because the vector code was excised before posting for review, the remaining functions that use it have probably lost any real meaning or use. For that reason, I'd suggest that either suppling the missing code or completely excising the parts that rely on it would both be reasonable ways to handle this. Although you've explained the rationale in the question (and it's also entirely reasonable) it means that those parts of the code can't really be meaningfully reviewed.

    Omit unused function parameters

    Since argc and argv are unused within main, they can simply be omitted.

    Omit return 0; in main

    Modern C and C++ compilers will automatically supply the equivalent of return 0; at the end of main, so it's not necessary to write it.

    \$\endgroup\$
    3
    • \$\begingroup\$trie_init does do t->sibling = t->child = NULL - that would have initialised both members if I am not mistaken? having the function pointer as a parameter to trie_init appears to be a very neat way to both reduce the cluttered parameter lists and to require callers to only shove homogeneous data into the same trie, and I'm now surprised I didn't think of doing that in the first place.\$\endgroup\$
      – user137130
      CommentedMay 8, 2017 at 13:25
    • \$\begingroup\$Yes, but trie_init is only called once. Every subsequent node inserted will also need all data members initialized. I'd highly recommend using something like the trie_new function I suggested.\$\endgroup\$
      – Edward
      CommentedMay 8, 2017 at 13:27
    • \$\begingroup\$..oh dear, I was somehow under the impression that the nodes in the insert function were added by calls to trie_init instead of individual mallocs and assignments. Sorry about that.\$\endgroup\$
      – user137130
      CommentedMay 8, 2017 at 13:44