1

Background:

I am writing an application for a small embedded device. There is a static list of strings: currently about 500 strings and string length is 12 characters on average. The list might increase in the future. When the application starts running, the list of strings will be parsed once. No more additions, deletions to the list. The search() will be called large number of times.

I need to find if a input string is in the list. Currently, I am sorting and storing the strings in an array. I am using binary search to find the string. Worst case is 0(log n) and space complexity is 0(12*n) (??).

Question:

Is there a better approach? I know hash tables are faster. But I am concerned about extra memory. I can create a hash table of size n. But I would need a Perfect hash function. Any suggestions on how to come up with such a hash function? Is it just trial and error? Any other data structure that could be better?

3
  • This sounds like some kind of configuration. What are allowed values? Maybe consider an enum for indexes and a union for values?
    – jaskij
    CommentedAug 3, 2018 at 22:15
  • suggest writing your code using a 'brute force' method (lots of looping and calls to strcmp() Post that code, Then we can advise on methods of improving it.CommentedAug 4, 2018 at 21:52
  • "better" can be only answered when knowing your performance requirements. Given the described parameters, your binary search could be already overengineered, and a simple linear search may be the better alternative (because it is less error prone, less code to maintain, easier to be extended etc.) If you really need a faster algo, the don't write "better", write "faster", and tell us what performance you measured and how much faster you need it.
    – Doc Brown
    CommentedAug 5, 2018 at 14:16

1 Answer 1

4

Your binary search is probably good enough. Alternative approaches will lead to a lot more complexity, for likely little gain.

There are a couple of optimization approaches that you can try:

  • A trie data structure might incur significant space overhead, but will allow you to determine set membership in an O(k) operation (with k the length of the input).

  • Instead of performing a binary search over all possible strings, you can partition them using a cheap hash function. If the data is appropriately distributed, using the first character as a hash function might be sufficient.

  • If you expect that many strings will not be in the set to be tested, you can use a fast approximate membership test such as a bloom filter. A bloom filter consists of multiple hash functions and a bit vector. All members of the set are hashed with each hash function, and the bit indicated by the hash function is set. To test membership you hash the input string and check whether the corresponding bits are set. If not, the input is not in the set. If the bits are set, the input may be in the set and you'll have to fall back to an exact membership test.

    An attractive property of bloom filters is that their size and computational overhead is completely configurable, but these properties also affect the quality of the filter.

Of course you can mix and match approaches.

Perfect hash functions sound nice, but are not frequently used in practice: constructing the hash function is tricky, and good-enough alternatives exist for many use cases.

3
  • Regarding Bullet #2, using the length of the String as a "cheap hash" might also be appropriate.CommentedAug 4, 2018 at 1:12
  • 1
    @user949300 depends on the data. Early PHP did that internally. For many data sets I'd expect the length to cluster around a mean value which does not result in a good distribution. If the strings are long and string length is determined through an O(n) delimiter search like strlen(), then a real fast hash function won't be much slower either. E.g. FNV-1a has around 6 instructions per word (incl. one integer multiplication), a simple strlen() can be written in 3 instructions per word (though possibly better with SIMD instructions).
    – amon
    CommentedAug 5, 2018 at 14:35
  • I missed the "C" tag, where strlen is an O(n). You are correct that it may not be a great hash.CommentedAug 6, 2018 at 6:01

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.