3
\$\begingroup\$

Is my algorithm slow because it has problems? Or, is any bitwise common prefix solution not going to give me the performance I'm looking for?

After profiling my algorithm, I found that over 60% of the time was spent on one line len(os.path.commonprefix((item1, item2))). I'm only looking for the length of the prefix

To solve this I tried to write a bitwise prefix solution

def bit_prefix(a, b): min_len = min(len(a), len(b)) if min_len > 0: x = str(bin( int(a[::-1].encode('hex'), 16) ^ int(b[::-1].encode('hex'), 16))) y = x.strip('0') if len(y) is 1: return min_len else: return (len(x) - len(y)) / 8 else: return 0 

I've only gotten a marginal improvement in speed with long prefixes

a = 'a' * 1000000 + 'z' b = 'a' * 900000 + 'z' timeit.timeit(lambda: bit_prefix(a, b), number=100) Out[34]: 6.340372534867129 timeit.timeit(lambda: len(os.path.commonprefix((a, b))), number=100) Out[35]: 7.5483549056534684 print bit_prefix(a, b), len(os.path.commonprefix((a, b))) 900000 900000 

And my algorithm performs more poorly with short prefixes

a = 'aaz' b = 'az' timeit.timeit(lambda: bit_prefix(a, b), number=1000000) Out[42]: 3.968956086175467 timeit.timeit(lambda: len(os.path.commonprefix((a, b))), number=1000000) Out[43]: 1.1592788235707303 print bit_prefix(a, b), len(os.path.commonprefix((a, b))) 1 1 

If my algorithm isn't broken and a bitwise solution won't give me the performance boost I'm looking for, can you refer me to a common prefix solution that would outperform os.path.commonprefix?

Here's the bit_pefix profile

Line # Hits Time Per Hit % Time Line Contents ============================================================== 29 @profile 30 def bit_prefix(a, b): 31 36140 81099 2.2 4.2 min_len = min(len(a), len(b)) 32 36140 49386 1.4 2.5 if min_len > 0: 33 36140 49739 1.4 2.6 x = str( 34 36140 47846 1.3 2.5 bin( 35 36140 47232 1.3 2.4 int( 36 36140 89499 2.5 4.6 a[::-1] 37 36140 242994 6.7 12.5 .encode('hex'), 38 36140 164442 4.6 8.4 16) 39 ^ 40 36140 49601 1.4 2.5 int( 41 36140 88745 2.5 4.6 b[::-1] 42 36140 216488 6.0 11.1 .encode('hex'), 43 36140 504425 14.0 25.9 16))) 44 36140 187571 5.2 9.6 y = x.strip('0') 45 36140 61027 1.7 3.1 if len(y) is 1: 46 return min_len 47 else: 48 36140 67507 1.9 3.5 return (len(x) - len(y)) / 8 49 else: 50 return 0 

Update

I've created another algorithm that seems to be faster than others. Here's the Code Review for that.

\$\endgroup\$
3
  • 1
    \$\begingroup\$I don't think you'll get anything that is much faster. I thought that using enumerate() with itertools.izip() would be blisteringly fast since they return generators, so a minimum of characters are iterated, but even that came out with just barely better or just barely worse: about the same.\$\endgroup\$
    – zondo
    CommentedMar 28, 2016 at 12:14
  • \$\begingroup\$@zondo Thanks for izip suggestion, it was better than everything else I found online! I tested it in another Code Review post.\$\endgroup\$CommentedMar 29, 2016 at 5:16
  • \$\begingroup\$I’m fundamentally puzzled why you think stepping through the input bit by bit would be faster. It requires a ton of preprocessing, for starters.\$\endgroup\$CommentedSep 20, 2019 at 14:47

1 Answer 1

2
\$\begingroup\$

Is my algorithm slow because it has problems?

Yes. The fundamental problem is that it always processes every character of the input.

There are some obvious improvements which don't address this fundamental problem:

  • Since the common prefix can't be more than min_len, truncate both strings to min_len before encoding them.
  • bin returns a string, so str(bin(...)) is overkill.
  • bin(int(a[::-1].encode('hex'), 16) ^ int(b[::-1].encode('hex'), 16)) is also overkill: what you care about is the position of the lowest set bit of int(a[::-1].encode('hex'), 16) ^ int(b[::-1].encode('hex'), 16). You can extract the lowest set bit directly from the integer as

    differences = int(a[::-1].encode('hex'), 16) ^ int(b[::-1].encode('hex'), 16) least_significant_difference = differences ^ (differences - 1) 

    (That actually gives a binary number which is all 1s from the lowest set bit to the least significant bit).

    Then you can either convert to string and find the length, or take the logarithm base 2.

But the fundamental problem is still there: the fastest code is the code which isn't executed, and when the prefix is very short then a simple loop from the start which compares characters is hard to beat. If you really have to beat it, I think you're probably looking at writing some C which coerces a pointer to (16-bit?) characters into a pointer to 64-bit integers and uses that to compare a block of characters at a time. Beware endianness.

\$\endgroup\$

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.