Python script to split overlapping ranges, version 4

Question

This is the fourth iteration of a Python script I wrote that splits overlapping ranges, and this is the fastest version I have wrote so far, and also a version that works for all inputs. I have done rigorous testing using randomly generated samples and compared the output against the output of a method that is known to be correct, and there are no errors.

I have a literally millions of triplets, the first two elements of each are integers, the third is of arbitrary type. The first number is always no larger than the second number, each triplet represents a integer range (inclusive) with associated data, the two numbers are start and end of the range, and the third element is the data that is associated with the range.

Each triplet (start, end, data) is equivalent to a dictionary, where the keys are all integers from start to end (including both), and the value for all the keys is data, for example, (42, 50, 'A') is equivalent to {42: 'A', 43: 'A', 44: 'A', 45: 'A', 46: 'A', 47: 'A', 48: 'A', 49: 'A', 50: 'A'}.

Now the ranges often overlap, and I wish to split the ranges so that all ranges are discrete and all ranges only have one datum which is the latest, and the number of ranges is minimal.

It is important to note if two ranges (s1, e1, d1) and (s2, e2, d2) overlap, one is always completely inside the other, one is always a complete subset, either s1 <= s2 <= e2 < e1 or s2 <= s1 <= e1 < e2, they never intersect in my data.

Two overlapping ranges can share the same data, in this case the smaller range needs to be ignored because there is no new data. For example, (48, 49, 'A') is {48: 'A', 49: 'A'} and it is already contained in my previous example, there is no data change.

If they overlap but have different data, then the smaller range's data overwrites the larger range, the data of the overlap is inherited from the newer one, the overlapping portion of the larger range is deleted, this is similar to dictionary update. For example, (43, 45, 'B') is {43: 'B', 44: 'B', 45: 'B'}, after operation with the first example, the data would be {42: 'A', 43: 'B', 44: 'B', 45: 'B', 46: 'A', 47: 'A', 48: 'A', 49: 'A', 50: 'A'}, and the triplet representation is [(42, 42, 'A'), (43, 45, 'B'), (46, 50, 'A')].

And finally sometimes the start of one range is the end of the previous range plus one, and both share the same data, they need to be joined, the previous range needs to be extended to the end of the new range, for example, given another range (51, 60, 'A'), the expected output for all the ranges given should be [(42, 42, 'A'), (43, 45, 'B'), (46, 60, 'A')].

Code

from collections import deque class Range: processed = deque() stack = deque() def __init__(self, ini, fin, data) -> None: self.ini = ini self.fin = fin self.data = data def __repr__(self): return f'Range({self.ini}, {self.fin}, {self.data})' def cover(self, fin): if fin <= self.fin: return True def join(self, ini, fin): if ini <= self.fin + 1: self.fin = fin return True def prepend(self, ini): if self.ini < ini: Range.processed.append(Range(self.ini, ini - 1, self.data)) def climb(self, fin): if self.ini < (ini := fin + 1) <= self.fin: self.ini = ini def update(self, ini, fin, data): if self.ini == ini and self.fin == fin: self.data = data return True def process(self, ini, fin, data): ignore = past = False if self.data == data: ignore = self.cover(fin) or self.join(ini, fin) elif ini <= self.fin and not (ignore := self.update(ini, fin, data)): self.prepend(ini) if not ignore: if ini > self.fin: Range.processed.append(self) Range.stack.pop() past = True self.climb(fin) return not ignore, past def discretize(ranges): Range.processed.clear() Range.stack.clear() for ini, fin, data in ranges: if not Range.stack: Range.stack.append(Range(ini, fin, data)) else: append, past = Range.stack[-1].process(ini, fin, data) while past and Range.stack: _, past = Range.stack[-1].process(ini, fin, data) if append: Range.stack.append(Range(ini, fin, data)) Range.processed.extend(reversed(Range.stack)) return Range.processed

I did hundred of tests using examples generated using this and compared the result against the method from the answer linked above, and the results were all correct, the script truly does exactly what I intend and there are no bugs.

But it is very important to note the input must be sorted by the start ascending and end descending before processing using my method, though in my case the input is always pre-sorted, and it won't work correctly if the ranges intersect instead of one being subset of the other, this is intentional because in my data this never happens, so I optimized that out.

I did try using Interval Tree as suggested by this answer, but it is extremely slow and doesn't suit my needs:

data = [ (3, 2789, 2681), (27, 2349, 2004), (29, 1790, 2072), (32, 1507, 1173), (35, 768, 3453), (63, 222, 2108), (341, 501, 2938), (378, 444, 258), (870, 888, 1364), (1684, 1693, 880) ]

In [73]: %timeit merge_rows(data) 1.24 ms ± 9.43 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

My method is by far faster, in fact all my previous methods are much faster than that, and the current one is the fastest:

In [76]: %timeit discretize(data) 23.3 µs ± 525 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Given this:

sample = [(2518812, 3931503981, 1222922854), (8283077, 3890282434, 2360189310), (21368257, 3571587114, 2651809994), (24571285, 2875098555, 1506778869), (40433776, 1751972813, 4162881198), (60822897, 1413187009, 921758666), (73435921, 1245663774, 3798627995), (169616863, 530686872, 1316302364), (273065748, 432363205, 3428529648), (291266380, 409571968, 4206669862), (558988640, 574048639, 4272236023), (619240775, 631101725, 1910576108), (727687992, 756241696, 3095538347), (825700858, 866696475, 2647033502), (853058882, 854026525, 2456782176), (967641241, 1001077664, 3556790281)]

The performance of my code is:

In [80]: %timeit discretize(sample) 36.8 µs ± 790 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

The code from the linked Stack Overflow answer is faster than mine, but it looks extremely ugly and I don't like it, it took me a while to understand how it works, and I have tried to optimize and refactor it but found it hard to do, so I wrote another version, but it somehow is a little slower:

In [81]: %timeit descretize(sample) 28.1 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

How can I make my code more efficient?

(I used ini for start and fin for end, to save characters thus making the code more concise, they are from Latin, initus is Latin for start and where you get __init__, finis is Latin for end and where you get "finish")

Update

It turns out my original code doesn't work properly, if a sub range shares the same data as the parent range, it should be ignored, but if the sub range is separated from the parent range by some other range, sometimes the sub range isn't properly ignored.

I only found this out after I have written a clever method to generate a big number of overlapping ranges, as answers are posted I am not allowed to edit the code.

Also the answer by @booboo doesn't work properly for the test cases generated by my new method, I have tested extensively using both the original method and the method optimized by me, they both fail for many test cases, and somehow frequently eats literally Gibibytes of RAM (I have 16 GiB and it uses 8GiB+ in a short time), freezing my computer and I had to reboot my computer.

So there aren't answers that are bounty worthy. If there are no new answers then the bounty has to be wasted.

The function I wrote to generate test cases:

import random def make_sample(num, lim, dat): if num * 2 > lim: lim = num * 2 + random.randrange(num) stack = [sorted(random.choices(range(lim), k=2*num))] ranges = [] while stack: top = stack.pop(0) l = len(top) if l == 2: ranges.append(top) elif l == 4: ranges.extend([top[:2], top[2:]]) else: choice = random.randrange(3) if not choice: ranges.append([top[0], top[-1]]) stack.append(top[1:-1]) elif choice == 1: i = l//3 i += i % 2 index = random.randrange(i, 2*i, 2) ranges.append(top[index:index+2]) stack.extend([top[:index], top[index+2:]]) else: index = random.randrange(2, l - 2, 2) stack.extend([top[:index], top[index:]]) ranges.sort(key=lambda x: (x[0], -x[1])) return [(a, b, random.randrange(dat)) for a, b in ranges]

New answers must generate the correct output for make_sample(2048, 65536, 32), the correct output can be verified by comparing against the output of descritize function found in the answer on Stack Overflow linked at the top of the question in the first paragraph.

And it also needs to be comparable in performance to it, and of course for it to be of comparable performance it must not eat up RAM and freeze the computer.

I edited the code that generates the samples to address bugs in the logic.

The code generates invalid input because the numbers are required to be unique, [(0, 1, 'A'), (0, 1, 'B')] is invalid, but cases like this can be generated by my code.

I first thought random.choices guarantees the choices to be unique if sample size is less than population size, I was wrong, it is just highly likely.

I fixed the code so that all numbers are guaranteed to be unique, and invalid test cases won't generated.

I edited the code but I haven't tested the functions thoroughly yet, my edit was immediately rolled back, though I would argue the edited code isn't what I wanted reviewed, this is an error by the one who made the reversion, though I won't edit the code again lest it be rolled back again.

And yes, the input is guaranteed to be sorted in ascending order, as I mentioned in my question multiple times and demonstrated with the code that generates test cases.

Lastly I am experiencing some "technical difficulties", more specifically my computer doesn't have internet connections right now because I just broke my modem (I just bought a new one online), and I am using my phone right now (which doesn't have reliable VPN connection now and in China Stack Exchange is throttled to nearly inaccessible), so updates will be much less frequent as of now.

But I will try to award the bounty within the bounty grace period if I found a bounty-worthy answer (if new answers are posted) or if I determined @booboo's method to be correct. That is, if I managed to visit this site at all, of course.

It is summer and my apartment building has experienced yet another blackout, my computer is a desktop and I can't even boot my computer to test the code. I was out before and now my phone is running out of juice, I don't know how long the blackout will last so there is a chance I won't be able to award the bounty.

My method is by far faster [than using Interval Trees] did you establish that using the data presented above, or millions of triplets? — greybeard, CommentedJul 6, 2023 at 9:02
Both above lists data and sample are in ascending order with respect to their first elements. Is this guaranteed? — greybeard, CommentedJul 6, 2023 at 9:22
I did point that out in my question, in case you missed it, yes, it is granted, the input is already sorted, and if it isn't I will sort first. — Ξένη Γήινος, CommentedJul 6, 2023 at 9:59
Are you able to specify your performance goals more concretely? For example, how long does it take for your current code to process 1 million triples, and what is your goal? Is your current implementation nearly fast enough, half as fast as it needs to be, or an order of magnitude too slow? Clarity on such matters is important because it can guide the allocation of effort. — FMc, CommentedJul 6, 2023 at 16:21
Please do post answers, as I really want the code to be improved, it is important enough I decided to put up a bounty. There are still four days until the end of the bounty, and I want at least another answer. I will award the bounty manually when it's over, and if I am unlucky nobody posted another answer I will award it to @booboo. — Ξένη Γήινος, CommentedJul 10, 2023 at 15:15

Booboo · Accepted Answer · 2023-07-14 12:45:43Z

Update

I have abandoned my implementation, which I agree does not work on the test case the OP posted. However, I do believe that the OP's code also produces erroneous results. So first I am replicating below comments I had made to my answer that highlight what I perceive to be the problem:

I have been trying to modify my code so that I get the same results with the test data you posted here. I printed out the first few ranges in your results and ranges 17 through 19 (origin 0) are Range(2414, 2573, 14), Range(2574, 2599, 9), Range(2414, 2574, 14). That does not look right to me. The algorithm you state is correct, which I have not studied, has the following ranges for elements 17 - 19: (2414, 2574, 14), (2575, 2599, 9), (2600, 2816, 6), which is clearly different from what your code produced. The test data has as input [... (2414, 2574, 14), (2574, 2599, 9), ...]. I would think that the value for key 2574 should be 9 if we apply the dictionary updates in order. This suggests to me that the known "correct" algorithm is not so correct and elements 17 - 19 should be (2414, 2573, 14), (2574, 2599, 9), (2600, 2816, 6).

To verify what I expected the correct results to be, I implemented code that computes the ranges in an inefficient but hopefully correct way: I allocate a list values whose size if MAX_RANGE_NUMBER - MIN_RANGE_NUMBER + 1 where MAX_RANGE_NUMBER would correspond to the maximum value of t[1] where t enumerates all the tuples in your test data and MIN_RANGE_NUMBER is the minimum value of t[0]. I initialize each element of the list to None. I then proceed to iterate each tuple in the test data. If a tuple were, for example, (17, 27, 19) I would assign values[17-MIN_RANGE_NUMBER:27+MIN_RANGE_NUMBER+1] = [19] * 11. After all ranges have been thus processed it is fairly simple to process the values list to come up with the non-overlapping ranges. I then print out the first 20 such ranges:

from typing import List, Tuple def discretize_correct(ranges: List[Tuple]) -> List[Tuple]: """We are passed a list of tuples each representing a range and we return a new list of range tuples.""" if not ranges: # Empty list return [] results = [] MIN_RANGE_NUMBER = min(range[0] for range in ranges) MAX_RANGE_NUMBER = max(range[1] for range in ranges) range_size = MAX_RANGE_NUMBER - MIN_RANGE_NUMBER + 1 values = [None] * range_size for range in ranges: start, end, value = range start -= MIN_RANGE_NUMBER end -= MIN_RANGE_NUMBER values[start:end+1] = [value] * (end + 1 - start) results = [] start_idx, last_value = None, None for idx, value in enumerate(values, start=MIN_RANGE_NUMBER): if value is None: if last_value is not None: results.append((start_idx, idx - 1, last_value)) last_value = None elif last_value is None: start_idx = idx last_value = value elif value != last_value: results.append((start_idx, idx - 1, last_value)) start_idx = idx last_value = value if last_value is not None: results.append((start_idx, MAX_RANGE_NUMBER, last_value)) return results if __name__ == '__main__': from ast import literal_eval with open('test.txt') as f: data = literal_eval(f.read()) ranges = discretize_correct(data) # Print out first 20 ranges: for i in range(20): print(f'{i}: {ranges[i]}')

Prints:

0: (10, 181, 2) 1: (182, 706, 11) 2: (707, 920, 6) 3: (921, 1018, 7) 4: (1019, 1032, 6) 5: (1033, 1200, 2) 6: (1201, 1204, 6) 7: (1205, 1617, 0) 8: (1618, 1667, 6) 9: (1668, 1905, 11) 10: (1906, 1977, 6) 11: (1978, 2015, 2) 12: (2016, 2075, 6) 13: (2076, 2082, 3) 14: (2083, 2201, 6) 15: (2202, 2320, 10) 16: (2321, 2413, 6) 17: (2414, 2573, 14) 18: (2574, 2599, 9) 19: (2600, 2816, 6)

Conclusion

The above verifies that the OP's code and the supposedly correct code posted here both produce results not only different from one another but also from what I believe are truly the correct results. Even if I have misunderstood the problem definition rendering the above results from my discretize_correct function irrelevant, the OP's results still vary from those produced from what the OP says is correct code.

Consequently, should the OP's code now be moved to stackoverflow.com for correction before it is re-posted here?

Some observations about your code, heapq.heapify is unnecessary here and only serves to waste execution time, since the input is already sorted, sorted sequences are a special kind of heap. Also there is no need to do another heapq.heappop to get range2, heaps always guarantee the first element is the smallest one, so using indexing (ranges[0]) will be much faster than heapq.heappush and saves the unnecessary heapq.heappush, the above two optimizations will save a huge amount of execution time. — Ξένη Γήινος, CommentedJul 12, 2023 at 10:43
Also, your code doesn't work properly, for example, if the input is [(0, 10, 'A'), (100, 200, 'B'), (150, 180, 'C')], the output is [(0, 10, 'A'), (11, 99, 'B'), (100, 149, 'B'), (150, 180, 'C'), (181, 200, 'B')], one extra range is added: (11, 99, 'B'), my code doesn't add new ranges that weren't in the original input, if there were gaps in the input, the gaps shouldn't be filled. — Ξένη Γήινος, CommentedJul 12, 2023 at 10:47
@ΞένηΓήινος Thanks for the input. I have updated the answer with the necessary changes. — Booboo, CommentedJul 12, 2023 at 11:23
Unfortunately, after extensive testing I have found that your code doesn't produce the correct output for large inputs, and it causes my interpreter to eat up my RAM (I have installed 16 GiB RAM and it gradually consumes 8GiB+ RAM), forcing me to reboot my computer. See my update. — Ξένη Γήινος, CommentedJul 12, 2023 at 13:50
I call copy on the original passed list so as to not destroy it. But if you do not need to keep it around after the results have been produced, the heapq.pop call will eventually reduce the size to 0. And is the input that fails too large to reproduce for me somewhere? — Booboo, CommentedJul 12, 2023 at 14:08

greybeard · Accepted Answer · 2023-07-06 10:26:37Z

I find your code avoidably hard to follow, lack of docstrings and comments contributing.
(Not quite sure about what else - method naming, possibly: prepend() is particularly miraculous. I understand join(), if not it's parameters - why isn't there just another Range? (Same for update() and process(), I guess, both names being non-suggestive.))

processed and stack "feel wrong" as Range class data members.
What would be the purpose of a class where these are instance data members?
Range_discretiser? Range_separator, Range_processor?

I don't think running your code faster is on topic here; options include PyPy, Cython and Numba.

use

 def cover(self, fin): return fin <= self.fin

– I don't like the (half implied) if ‹condition›: return True else: return ‹falsy›

 def cover(self, fin): if fin <= self.fin: return True else: # implied return None # falsy

J_H · Accepted Answer · 2023-07-08 15:04:25Z

GvR was right when he anticipated the := walrus operator would encourage folks to write hard-to-read code. Please resist the temptation.

class Range: processed = deque() stack = deque()

What is going on here?!? Singletons? With no motivation in the docs or comments? We need to understand why There Can Be Only One computation in progress.

Perhaps another class should hold these? Or they should be local to discretize() and passed around as parameters?

Please return a single type, such as bool on a predicate. Not because it makes things easier on the machine, but because it makes it easier for the Gentle Reader to see your meaning.

 def join(self, ini, fin): if ini <= self.fin + 1: self.fin = fin return True

This implicitly ends with return None. Prefer an explicit return False. Adopt mypy, and annotate this as returning ...) -> bool:

Similarly for the return value of update().

Both methods should """document""" that they are predicates, as the names do not spell that out at all. I would typically expect join() to return a str which combines its arguments, and update() to be void, evaluated for side effects.

 elif ini <= self.fin and not (ignore := self.update(ini, fin, data)): self.prepend(ini) if not ignore:

No.

Assign ignore, and then inspect it. It's an important state variable which the following if not inspects. Don't bury the state change in the 2nd argument of and. It's only assigned when ini <= self.fin. You're being too tricky.

Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?
—- Kernighan & Plauger, The Elements of Programming Style, chapter 2

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.

Plus, you're making it hard to recruit new graduates to your team who are willing and able to maintain your code.

In contrast, this line is easy to reason about:

 ignore = self.cover(fin) or self.join(ini, fin)

Consider renaming ignore, and/or commenting on what return not ignore means to the caller. Does every method need a """docstring"""? No. But this one really does, given how vacuous identifiers like process or do_it are.

Input ranges must be sorted, so consider offering a method or option which verifies that.

Stack Exchange Network

Python script to split overlapping ranges, version 4

Code

Update

3 Answers 3

Linked

Hot Network Questions

Python script to split overlapping ranges, version 4

Code

Update

3 Answers 3

Linked

Related

Hot Network Questions