The suggestions by @J_H are excellent, as usual.
From an algorithmic perspective, though, a two-way merge is a useful exercise, so I think it is valuable to keep to the original algorithm.
Be clear about the functionality
The name merge_arrays
fails to carry essential information:
- That the inputs should be sorted.
- That the outputs will then be sorted.
- That the outputs will be deduplicated.
Be clear about the input/output
Type annotations are the modern way, a docstring is always useful regardless. In any case, it's best to be clear about what goes in and what comes out.
Your use of indexes requires a list
(or similar) as input, and you return a list.
Lazy wins the day
It is a frequent task to have to return the first n elements of something, or the n-th smallest element.
Using your function for such a task will result in building a list of all elements, and then extracting the first n or looking at the n-th. It's rather inefficient.
In general, it is best to be lazy:
- Do not do work that the caller may not care for.
- Do not prematurely allocate a giant data-structure that the caller may not care for.
In this case, returning a generator would be best.
Lazy wins the day (bis)
Similarly, forcing the user to construct two lists just so they can be fed into the algorithm which will only pick one element at a time is overkill.
Ideally, you would accept iterables instead.
Deduplication does not require a set.
Since the inputs are sorted, knowing whether an element should be appended to the array only requires looking at the last element of the array.
Deduplication is orthogonal
Your function fails the Single Responsibility Principle:
- It performs a two-way merge.
- It also deduplicates.
It is best to split the two operations into two functions, letting the caller decide whether they want deduplication, or not.
Putting it all together
def dedup_sorted(input): """Removes duplicates from a sorted iterable Returns an iterable of sorted elements, without duplicates. """ input = iter(input) try: last = next(input) yield last except StopIteration: return for e in input: if e == last: continue last = e yield e def merge_sorted(left, right): """Merges two sorted iterables into one sorted iterable Duplicates are preserved. No guarantee is made as to the order in which duplicates are yielded. """ sentinel = object() left = iter(left) right = iter(right) x = next(left, sentinel) y = next(right, sentinel) while x is not sentinel and y is not sentinel: if x <= y: yield x x = next(left, sentinel) continue yield y y = next(right, sentinel) if x is not sentinel: yield x for x in left: yield x if y is not sentinel: yield y for y in right: yield y
The two functions are algorithmically optimal, performing a single pass over the data -- hence guaranteed O(N) -- and no more than N comparisons on the elements.
Furthermore, they can be chained together at minimal cost, so that your original function is equivalent to:
list(dedup_sorted(merged_sorted(_, _)))