Merging sorted integer arrays (without duplicates)

Question

I have written a Python function called merge_arrays which takes two lists and returns the new list merge without duplicate.
The function works as expected, but my code very complicated, and I'm looking for ways to optimize it.
Below is my current implementation:

def update_if_new(d, results, element): if element not in d: results.append(element) d[element] = True def merge_arrays(first, second): seen = dict() results = [] i, j, m, n = 0, 0, len(first), len(second) while i < m and j < n: x, y = first[i], second[j] if x < y: update_if_new(seen, results, x) i += 1 else: update_if_new(seen, results, y) j += 1 for residual in first[i:] + second[j:]: update_if_new(seen, results, residual) return results print(merge_arrays([1,2,3],[5,7,1])) #[1, 2, 3, 5, 7]

I'm seeking suggestions for improvements in the code to make it more efficient and reduce the execution time.
Any tips or alternative solutions would be greatly appreciated.

J_H · Accepted Answer · 2023-10-30 19:47:04Z

This is a two-way merge.

preconditions

merge_arrays([1, 2, 3], [5, 7, 1])

I was a little surprised to see that non-monotonic 1 tacked on at the end.

Recommend that you insist the caller presents only sorted( ... ) arrays, or that you call timsort yourself, perhaps before calling a private helper.

unique

Itertools is happy to help you discard repeated entries from sorted input. It costs O(N) linear time and no additional storage.

Or perhaps you'd rather spend O(N) storage to construct e.g. set(first), at similar cost to the seen dict you've been maintaining.

Consider writing a two-way merge generator which will yield monotonic array entries.

use builtins

The simplest solution would look like this.

def merge_arrays(first: list[int], second: list[int]) -> list[int]: return sorted(set(first + second))

Depending on details of your data and required performance, you might want a few tweaks. For example, initially de-dup'ing two separate sets and then combining those smaller results, destructively running .sort() on caller's list in-place, that sort of thing.

It is not necessary to specify types in Python because Python is not a statically typed language — TAHER El Mehdi, CommentedOct 30, 2023 at 19:49
That's true. But it is a kindness to the Gentle Reader, offering a clue about what will come back. We might accept list[tuple] instead. Or even list, though that is vague enough that it compels us to write a """docstring""" explaining what caller should expect. Strictly all we need for this implementation is an Iterable, though that precludes certain techniques such as doing a destructive .sort() in-place. — J_H, CommentedOct 30, 2023 at 19:52
If you throw all the elements into a set, you don't need sorted inputs. (But set itself is unordered, and dict preserves insertion order rather than using keys to order, unlike C++ std::map (typically a red-black tree)). To avoid doing the full work of sorting, you could use a set to check for and remove duplicates while doing a merge. But then you'd probably have to loop over elements one at a time, so probably slower in CPython than what you're doing with set() and sort(). — Peter Cordes, CommentedOct 31, 2023 at 10:15
I'd say that list[int] is so strict type here that it should be either omitted (in simple code) or replaced with real constraints (Sequence of typevar bound to a protocol with __hash__ and __lt__ OR __gt__, similar to definition of sorted in typeshed, but with additional __hash__ requirement). Why not let this function merge list of strings or list of tuples? If you replace list concatenation with set union, you can also have Iterable instead of Sequence. — STerliakov, CommentedNov 1, 2023 at 0:49

Matthieu M. · Accepted Answer · 2023-11-01 15:30:36Z

The suggestions by @J_H are excellent, as usual.

From an algorithmic perspective, though, a two-way merge is a useful exercise, so I think it is valuable to keep to the original algorithm.

Be clear about the functionality

The name merge_arrays fails to carry essential information:

That the inputs should be sorted.
That the outputs will then be sorted.
That the outputs will be deduplicated.

Be clear about the input/output

Type annotations are the modern way, a docstring is always useful regardless. In any case, it's best to be clear about what goes in and what comes out.

Your use of indexes requires a list (or similar) as input, and you return a list.

Lazy wins the day

It is a frequent task to have to return the first n elements of something, or the n-th smallest element.

Using your function for such a task will result in building a list of all elements, and then extracting the first n or looking at the n-th. It's rather inefficient.

In general, it is best to be lazy:

Do not do work that the caller may not care for.
Do not prematurely allocate a giant data-structure that the caller may not care for.

In this case, returning a generator would be best.

Lazy wins the day (bis)

Similarly, forcing the user to construct two lists just so they can be fed into the algorithm which will only pick one element at a time is overkill.

Ideally, you would accept iterables instead.

Deduplication does not require a set.

Since the inputs are sorted, knowing whether an element should be appended to the array only requires looking at the last element of the array.

Deduplication is orthogonal

Your function fails the Single Responsibility Principle:

It performs a two-way merge.
It also deduplicates.

It is best to split the two operations into two functions, letting the caller decide whether they want deduplication, or not.

Putting it all together

def dedup_sorted(input): """Removes duplicates from a sorted iterable Returns an iterable of sorted elements, without duplicates. """ input = iter(input) try: last = next(input) yield last except StopIteration: return for e in input: if e == last: continue last = e yield e def merge_sorted(left, right): """Merges two sorted iterables into one sorted iterable Duplicates are preserved. No guarantee is made as to the order in which duplicates are yielded. """ sentinel = object() left = iter(left) right = iter(right) x = next(left, sentinel) y = next(right, sentinel) while x is not sentinel and y is not sentinel: if x <= y: yield x x = next(left, sentinel) continue yield y y = next(right, sentinel) if x is not sentinel: yield x for x in left: yield x if y is not sentinel: yield y for y in right: yield y

The two functions are algorithmically optimal, performing a single pass over the data -- hence guaranteed O(N) -- and no more than N comparisons on the elements.

Furthermore, they can be chained together at minimal cost, so that your original function is equivalent to:

list(dedup_sorted(merged_sorted(_, _)))

dedup_sorted([None]) returns an empty list. It's probably better to instantiate last with a unique sentinel like object(). — JAD, CommentedOct 31, 2023 at 13:30
@JAD: Thanks for the tip! Could the same be applied to merge_sorted? — Matthieu M., CommentedOct 31, 2023 at 14:05
@MatthieuM: given that None < 3 throws an exception that '<' is not supported between instances of NoneType and int and that merge_sorted implies sortability, I'd argue that None is acceptable to use and that the type should be something like Iterable of numbers which precludes [as much as python types actually preclude] passing None (or a whole lot of other things that also aren't sortable in python, e.g. ['3', 3] — Foon, CommentedOct 31, 2023 at 14:33
FYI, instead of using None, here's the standard idiom for creating a sentinel which your caller definitely won't be passing in to you: sentinel = object(). The object allocator guarantees us a unique address. And then we can test whether x is sentinel. — J_H, CommentedOct 31, 2023 at 15:27
@IlmariKaronen: The OP's code does not seem to assume that the inputs are deduplicated. And yes, there's a minor efficiency gain in doing everything at once... I'd rather focus on correctness and composability first and only address minor efficiency concerns if they prove worth it. Remember Knuth, "Premature Optimization is the Root of All Evil". — Matthieu M., CommentedNov 1, 2023 at 9:07

greybeard · Accepted Answer · 2023-11-01 08:53:57Z

The code does not document what's expected of the input:
(ascendantly) ordered lists (mentioned in the title, only).

Names not to be used externally (especially "global" ones) should begin with an underscore: _update_if_new().
Which looks introduced to avoid some potential code duplication -
I think this can be achieved more simply.

 while i < m and j < n: x, y = first[i], second[j] if x < y: element = x i += 1 else: element = y j += 1 if element not in s: results.append(element) s.add(element) # d[element] = True

(Has the code from update_if_new() been duplicated in merge_arrays() originally, indented 2 places as indentation grew unwieldy?)

As the sequences are required to be ordered, you don't need a set for uniqueness - using a dict is uncalled for logically.
Without type hinting, borrowing liberally from the notably clean code in this answer by AJNeufeld:

def merge_arrays(first, second): """ Return a new list containing in non-decreasing order the values from both non-decreasing sequences first & second, except values from second already occurring in first. """ try: left_iter = iter(first) left = next(left_iter) except: return list(right) try: right_iter = iter(second) right = next(right_iter) except: return list(left) # ascending = [] try: while True: if left < right: ascending.append(left) left = next(left_iter) else: if left != right: # don't append equal "right" values ascending.append(right) right = next(right_iter) except StopIteration: if left > right: ascending.append(left) ascending.extend(left_iter) else: ascending.append(right) ascending.extend(right_iter) # return ascending if __name__ == "__main__": help(merge_arrays) a, b = [1,3,3,4,5,7], [1,2,3,5,6,7] print("merging", a, "and", b, "- note in-list duplication") print(merge_arrays(a, b))

(One problem with stating items with unique key values from multiple sources is it does not specify what to do with keys occurring in more than one source:

exclude them entirely?
include exactly one?
Which?)

Attila Vajda · Accepted Answer · 2023-11-01 16:36:07Z

How to find the fastest existing algorithm for sorting and merging?

Seeing how sorted() and set() in Python are implemented, and perhaps rephrasing them might be a good exercise. Visualise, play, investigate.

You can also use existing functions, they make your code stronger. Perhaps refactoring these implementations might be something to look into.

sorted(set(first + second))

sorted(set(first).union(second))

What is the purpose of your optimisation? Is this is an exercise in optimisation algorithm writing?

I don't know if the following is a good solution or not:

>>> a = [3, 2, 1] >>> b = [5, 3, 2] >>> sorted(*[({*a}|{*b}]) [1, 2, 3, 5]

Using only set() seems to sort a few integers, but it doesn't sort a high number of elements.

(I' expect set(a) rather than {x for x in a} or {*a}.) — greybeard, CommentedNov 1, 2023 at 8:18
Thank you @greybeard, I thought ` merged = [x for x in a + b if x not in merged]` would work, but it does not. set(a) seems more readable, as {} can be mistaken for dictionary notation. — Attila Vajda, CommentedNov 1, 2023 at 13:34
Note that none of the examples here sort the numbers in the output list. A set is not sorted, so [*set] is not sorted. For small integers it seems sorted, but if you put larger numbers in, you'll notice it's not. — Cris Luengo, CommentedNov 1, 2023 at 15:02

Stack Exchange Network

Merging sorted integer arrays (without duplicates)

4 Answers 4

preconditions

unique

use builtins

Be clear about the functionality

Be clear about the input/output

Lazy wins the day

Lazy wins the day (bis)

Deduplication does not require a set.

Deduplication is orthogonal

Putting it all together

Linked

Hot Network Questions

Merging sorted integer arrays (without duplicates)

4 Answers 4

preconditions

unique

use builtins

Be clear about the functionality

Be clear about the input/output

Lazy wins the day

Lazy wins the day (bis)

Deduplication does not require a set.

Deduplication is orthogonal

Putting it all together

Linked

Related

Hot Network Questions