Is my algorithm slow because it has problems? Or, is any bitwise common prefix solution not going to give me the performance I'm looking for?
After profiling my algorithm, I found that over 60% of the time was spent on one line len(os.path.commonprefix((item1, item2)))
. I'm only looking for the length of the prefix
To solve this I tried to write a bitwise prefix solution
def bit_prefix(a, b): min_len = min(len(a), len(b)) if min_len > 0: x = str(bin( int(a[::-1].encode('hex'), 16) ^ int(b[::-1].encode('hex'), 16))) y = x.strip('0') if len(y) is 1: return min_len else: return (len(x) - len(y)) / 8 else: return 0
I've only gotten a marginal improvement in speed with long prefixes
a = 'a' * 1000000 + 'z' b = 'a' * 900000 + 'z' timeit.timeit(lambda: bit_prefix(a, b), number=100) Out[34]: 6.340372534867129 timeit.timeit(lambda: len(os.path.commonprefix((a, b))), number=100) Out[35]: 7.5483549056534684 print bit_prefix(a, b), len(os.path.commonprefix((a, b))) 900000 900000
And my algorithm performs more poorly with short prefixes
a = 'aaz' b = 'az' timeit.timeit(lambda: bit_prefix(a, b), number=1000000) Out[42]: 3.968956086175467 timeit.timeit(lambda: len(os.path.commonprefix((a, b))), number=1000000) Out[43]: 1.1592788235707303 print bit_prefix(a, b), len(os.path.commonprefix((a, b))) 1 1
If my algorithm isn't broken and a bitwise solution won't give me the performance boost I'm looking for, can you refer me to a common prefix solution that would outperform os.path.commonprefix?
Here's the bit_pefix profile
Line # Hits Time Per Hit % Time Line Contents ============================================================== 29 @profile 30 def bit_prefix(a, b): 31 36140 81099 2.2 4.2 min_len = min(len(a), len(b)) 32 36140 49386 1.4 2.5 if min_len > 0: 33 36140 49739 1.4 2.6 x = str( 34 36140 47846 1.3 2.5 bin( 35 36140 47232 1.3 2.4 int( 36 36140 89499 2.5 4.6 a[::-1] 37 36140 242994 6.7 12.5 .encode('hex'), 38 36140 164442 4.6 8.4 16) 39 ^ 40 36140 49601 1.4 2.5 int( 41 36140 88745 2.5 4.6 b[::-1] 42 36140 216488 6.0 11.1 .encode('hex'), 43 36140 504425 14.0 25.9 16))) 44 36140 187571 5.2 9.6 y = x.strip('0') 45 36140 61027 1.7 3.1 if len(y) is 1: 46 return min_len 47 else: 48 36140 67507 1.9 3.5 return (len(x) - len(y)) / 8 49 else: 50 return 0
Update
I've created another algorithm that seems to be faster than others. Here's the Code Review for that.
enumerate()
withitertools.izip()
would be blisteringly fast since they return generators, so a minimum of characters are iterated, but even that came out with just barely better or just barely worse: about the same.\$\endgroup\$