2

First, this question is not really about binary search as I neither have sorted data, nor any sortable data at all. :-)

W.r.t the "unsortable" claim see below; but I think the title term "unsortable" is important here, because, afaikt, any sorting would require visiting all "nodes", which is exactly what I strife to avoid.

Question: Provide terms for the algorithm/algorithmic search approach taken below.

Long winded intro by example: What I have is, essentially, a list of names, and I need to find all "valid" names. However, to determine the valid names, I must query a 3rd party system, that, in terms of this simplified problem, only has one interface function:

// remote 3rd party interface // return true if all names in the given list are valid // [names] may be a list of 1 - n strings bool has_invalid_names(List names) 

So, given my list l := [a, b, c, ...], I need to collect all valid names, meaning has_invalid_names([i]) == false.

The easiest thing is obviously to just do a linear search, but that means calling all_names_are_valid(i) n times, which can get prohibitively expensive, as each remote call has quite some overhead.

Given that we expect only a very few invalid names in the initial list, we intuitively came up with the idea to just slice the list up into halves and search over these:

a) Split list in halve b) for each half, check if any names are invalid c) rinse and repeat for all slices that show "invalid" until you're down to calls with just single names 

Which just builds up a tree of slices of the initial list

[a, b, c, d, e, f, g*, h] /\ [a, b, c, d] [e, f, g*, h] /\ /\ [a, b] [c, d] [e, f] [g*, h] /\ /\ /\ /\ [a][b] [c][d] [e][f] [g*][h] 

If the number of invalid names is small, only a very few paths down this tree have to be explored, essentially requiring only near log2(n) calls, where the worst case, where all names are invalid, would be 2*n-1.

In this example above, assuming we try sub-lists left-to-right, we'd need to call the remote function 6 times to find the "invalid" g*. For larger n, the savings would be more significant, as long as only very few items are invalid.

So, I have a solution to this problem, but what is this stuff called?

  • Is there any specific name for splitting up a dataset into a tree of slices?
  • What could the search algorithm / parts of it be called?

    It seems kind of an unsortedbinary tree search, where a predicate is applied to each node to decide which sub-trees to follow, but I'm really weak on the whole algorithms & datas structures zoo. :-)

    Mechanically, it seems more in the direction of what map, filter (, reduce) achieve, with a more complex map function.

Put another way: If you want to solve this problem, which libraries, existing algorithms, and data structures would you use? -> Note: The whole question was triggered by the fact that I'm lazy and don't want to code up all that slicing, tree'ing and possibly recursing myself, but the Wide Net didn't indulge me with anything prefabricated. :-)

1
  • It is usually called "divide and conquer". I believe your worst case analysis is incorrect: If all n are invalid, it would take n service calls to identify that. With your algorithm, you still have to n service calls at the bottom of the tree, and then n / 2 calls at the next level up, etc, for a total of 2 n service calls.CommentedJul 6, 2018 at 22:03

1 Answer 1

1

This can be interpreted as a binary search for invalid elements. It is a kind of divide-and-conquer strategy.

You would typically not use any special data structures or algorithms for this, as recursion is entirely sufficient. The tree you are seeing is merely encoded as the call graph, and never has to exist explicitly. Given a split() operation that divides an array in half, we can find all invalid elements:

find_invalid_elements(input: array): array return [] if input is empty return [] if not has_invalid_element(input) return input if input size = 1 left, right := split(input) return find_invalid_elements(left) + find_invalid_elements(right) 

If the has_invalid_element() oracle is the only available source of information, then I think this is optimal. A much more useful oracle would give you the number of invalid elements, as we can then test multiple overlapping partitions, instead of recursing into smaller and smaller slices.

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.