I am trying to optimize a Reed-Solomon encoder, which is in fact simply a polynomial division operation over Galois Fields 2^8 (which simply means that values wrap-around over 255). The code is in fact very very similar to what can be found here for Go: http://research.swtch.com/field
The algorithm for polynomial division used here is a synthetic division (also called Horner's method).
I tried everything: numpy, pypy, cython. The best performance I get is by using pypy with this simple nested loop:
def rsenc(msg_in, nsym, gen): '''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field''' msg_out = bytearray(msg_in) + bytearray(len(gen)-1) lgen = bytearray([gf_log[gen[j]] for j in xrange(len(gen))]) for i in xrange(len(msg_in)): coef = msg_out[i] # coef = gf_mul(msg_out[i], gf_inverse(gen[0])) // for general polynomial division (when polynomials are non-monic), we need to compute: coef = msg_out[i] / gen[0] if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw) lcoef = gf_log[coef] # precaching for j in xrange(1, len(gen)): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that's why we start at position 1) msg_out[i + j] ^= gf_exp[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] += msg_out[i] * gen[j] # Recopy the original message bytes msg_out[:len(msg_in)] = msg_in return msg_out
Can a Python optimization wizard guide me to some clues on how to get a speedup? My goal is to get at least a speedup of 3x, but more would be awesome. Any approach or tool is accepted, as long as it is cross-platform (works at least on Linux and Windows).
Here is a small test script with some of the other alternatives I tried (the cython attempt is not included since it was slower than native python!):
import random from operator import xor numpy_enabled = False try: import numpy as np numpy_enabled = True except ImportError: pass # Exponent table for 3, a generator for GF(256) gf_exp = bytearray([1, 3, 5, 15, 17, 51, 85, 255, 26, 46, 114, 150, 161, 248, 19, 53, 95, 225, 56, 72, 216, 115, 149, 164, 247, 2, 6, 10, 30, 34, 102, 170, 229, 52, 92, 228, 55, 89, 235, 38, 106, 190, 217, 112, 144, 171, 230, 49, 83, 245, 4, 12, 20, 60, 68, 204, 79, 209, 104, 184, 211, 110, 178, 205, 76, 212, 103, 169, 224, 59, 77, 215, 98, 166, 241, 8, 24, 40, 120, 136, 131, 158, 185, 208, 107, 189, 220, 127, 129, 152, 179, 206, 73, 219, 118, 154, 181, 196, 87, 249, 16, 48, 80, 240, 11, 29, 39, 105, 187, 214, 97, 163, 254, 25, 43, 125, 135, 146, 173, 236, 47, 113, 147, 174, 233, 32, 96, 160, 251, 22, 58, 78, 210, 109, 183, 194, 93, 231, 50, 86, 250, 21, 63, 65, 195, 94, 226, 61, 71, 201, 64, 192, 91, 237, 44, 116, 156, 191, 218, 117, 159, 186, 213, 100, 172, 239, 42, 126, 130, 157, 188, 223, 122, 142, 137, 128, 155, 182, 193, 88, 232, 35, 101, 175, 234, 37, 111, 177, 200, 67, 197, 84, 252, 31, 33, 99, 165, 244, 7, 9, 27, 45, 119, 153, 176, 203, 70, 202, 69, 207, 74, 222, 121, 139, 134, 145, 168, 227, 62, 66, 198, 81, 243, 14, 18, 54, 90, 238, 41, 123, 141, 140, 143, 138, 133, 148, 167, 242, 13, 23, 57, 75, 221, 124, 132, 151, 162, 253, 28, 36, 108, 180, 199, 82, 246] * 2 + [1]) # Logarithm table, base 3 gf_log = bytearray([0, 0, 25, 1, 50, 2, 26, 198, 75, 199, 27, 104, 51, 238, 223, # BEWARE: the first entry should be None instead of 0 because it's undefined, but for a bytearray we can't set such a value 3, 100, 4, 224, 14, 52, 141, 129, 239, 76, 113, 8, 200, 248, 105, 28, 193, 125, 194, 29, 181, 249, 185, 39, 106, 77, 228, 166, 114, 154, 201, 9, 120, 101, 47, 138, 5, 33, 15, 225, 36, 18, 240, 130, 69, 53, 147, 218, 142, 150, 143, 219, 189, 54, 208, 206, 148, 19, 92, 210, 241, 64, 70, 131, 56, 102, 221, 253, 48, 191, 6, 139, 98, 179, 37, 226, 152, 34, 136, 145, 16, 126, 110, 72, 195, 163, 182, 30, 66, 58, 107, 40, 84, 250, 133, 61, 186, 43, 121, 10, 21, 155, 159, 94, 202, 78, 212, 172, 229, 243, 115, 167, 87, 175, 88, 168, 80, 244, 234, 214, 116, 79, 174, 233, 213, 231, 230, 173, 232, 44, 215, 117, 122, 235, 22, 11, 245, 89, 203, 95, 176, 156, 169, 81, 160, 127, 12, 246, 111, 23, 196, 73, 236, 216, 67, 31, 45, 164, 118, 123, 183, 204, 187, 62, 90, 251, 96, 177, 134, 59, 82, 161, 108, 170, 85, 41, 157, 151, 178, 135, 144, 97, 190, 220, 252, 188, 149, 207, 205, 55, 63, 91, 209, 83, 57, 132, 60, 65, 162, 109, 71, 20, 42, 158, 93, 86, 242, 211, 171, 68, 17, 146, 217, 35, 32, 46, 137, 180, 124, 184, 38, 119, 153, 227, 165, 103, 74, 237, 222, 197, 49, 254, 24, 13, 99, 140, 128, 192, 247, 112, 7]) if numpy_enabled: np_gf_exp = np.array(gf_exp) np_gf_log = np.array(gf_log) def gf_pow(x, power): return gf_exp[(gf_log[x] * power) % 255] def gf_poly_mul(p, q): r = [0] * (len(p) + len(q) - 1) lp = [gf_log[p[i]] for i in xrange(len(p))] for j in range(len(q)): lq = gf_log[q[j]] for i in range(len(p)): r[i + j] ^= gf_exp[lp[i] + lq] return r def rs_generator_poly_base3(nsize, fcr=0): g_all = {} g = [1] g_all[0] = g_all[1] = g for i in range(fcr+1, fcr+nsize+1): g = gf_poly_mul(g, [1, gf_pow(3, i)]) g_all[nsize-i] = g return g_all # Fastest way with pypy def rsenc(msg_in, nsym, gen): '''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field''' msg_out = bytearray(msg_in) + bytearray(len(gen)-1) lgen = bytearray([gf_log[gen[j]] for j in xrange(len(gen))]) for i in xrange(len(msg_in)): coef = msg_out[i] # coef = gf_mul(msg_out[i], gf_inverse(gen[0])) # for general polynomial division (when polynomials are non-monic), the usual way of using synthetic division is to divide the divisor g(x) with its leading coefficient (call it a). In this implementation, this means:we need to compute: coef = msg_out[i] / gen[0] if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw) lcoef = gf_log[coef] # precaching for j in xrange(1, len(gen)): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that's why we start at position 1) msg_out[i + j] ^= gf_exp[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] += msg_out[i] * gen[j] # Recopy the original message bytes msg_out[:len(msg_in)] = msg_in return msg_out # Alternative 1: the loops were completely changed, instead of fixing msg_out[i] and updating all subsequent i+j items, we now fixate msg_out[i+j] and compute it at once using all couples msg_out[i] * gen[j] - msg_out[i+1] * gen[j-1] - ... since when we fixate msg_out[i+j], all previous msg_out[k] with k < i+j are already fully computed. def rsenc_alt1(msg_in, nsym, gen): msg_in = bytearray(msg_in) msg_out = bytearray(msg_in) + bytearray(len(gen)-1) lgen = bytearray([gf_log[gen[j]] for j in xrange(len(gen))]) # Alternative 1 jlist = range(1, len(gen)) for k in xrange(1, len(msg_out)): for x in xrange(max(k-len(msg_in),0), len(gen)-1): if k-x-1 < 0: break msg_out[k] ^= gf_exp[msg_out[k-x-1] + lgen[jlist[x]]] # Recopy the original message bytes msg_out[:len(msg_in)] = msg_in return msg_out # Alternative 2: a rewrite of alternative 1 with generators and reduce def rsenc_alt2(msg_in, nsym, gen): msg_in = bytearray(msg_in) msg_out = bytearray(msg_in) + bytearray(len(gen)-1) lgen = bytearray([gf_log[gen[j]] for j in xrange(len(gen))]) # Alternative 1 jlist = range(1, len(gen)) for k in xrange(1, len(msg_out)): items_gen = ( gf_exp[msg_out[k-x-1] + lgen[jlist[x]]] if k-x-1 >= 0 else next(iter(())) for x in xrange(max(k-len(msg_in),0), len(gen)-1) ) msg_out[k] ^= reduce(xor, items_gen) # Recopy the original message bytes msg_out[:len(msg_in)] = msg_in return msg_out # Alternative with Numpy def rsenc_numpy(msg_in, nsym, gen): msg_in = np.array(bytearray(msg_in)) msg_out = np.pad(msg_in, (0, nsym), 'constant') lgen = np_gf_log[gen] for i in xrange(msg_in.size): msg_out[i+1:i+lgen.size] ^= np_gf_exp[np.add(lgen[1:], msg_out[i])] msg_out[:len(msg_in)] = msg_in return msg_out gf_mul_arr = [bytearray(256) for _ in xrange(256)] gf_add_arr = [bytearray(256) for _ in xrange(256)] # Precompute multiplication and addition tables def gf_precomp_tables(gf_exp=gf_exp, gf_log=gf_log): global gf_mul_arr, gf_add_arr for i in xrange(256): for j in xrange(256): gf_mul_arr[i][j] = gf_exp[gf_log[i] + gf_log[j]] gf_add_arr[i][j] = i ^ j return gf_mul_arr, gf_add_arr # Alternative with precomputation of multiplication and addition tables, inspired by zfec: https://hackage.haskell.org/package/fec-0.1.1/src/zfec/fec.c def rsenc_precomp(msg_in, nsym, gen=None): msg_in = bytearray(msg_in) msg_out = bytearray(msg_in) + bytearray(len(gen)-1) for i in xrange(len(msg_in)): # [i for i in xrange(len(msg_in)) if msg_in[i] != 0] coef = msg_out[i] if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw) mula = gf_mul_arr[coef] for j in xrange(1, len(gen)): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that's why we start at position 1) #msg_out[i + j] = gf_add_arr[msg_out[i+j]][gf_mul_arr[coef][gen[j]]] # slower... #msg_out[i + j] ^= gf_mul_arr[coef][gen[j]] # faster msg_out[i + j] ^= mula[gen[j]] # fastest # Recopy the original message bytes msg_out[:len(msg_in)] = msg_in # equivalent to c = mprime - b, where mprime is msg_in padded with [0]*nsym return msg_out def randstr(n, size): '''Generate very fastly a random hexadecimal string. Kudos to jcdryer http://stackoverflow.com/users/131084/jcdyer''' hexstr = '%0'+str(size)+'x' for _ in xrange(n): yield hexstr % random.randrange(16**size) # Simple test case if __name__ == "__main__": # Setup functions to test funcs = [rsenc, rsenc_precomp, rsenc_alt1, rsenc_alt2] if numpy_enabled: funcs.append(rsenc_numpy) gf_precomp_tables() # Setup RS vars n = 255 k = 213 import time # Init the generator polynomial g = rs_generator_poly_base3(n) # Init the ground truth mes = 'hello world' mesecc_correct = rsenc(mes, n-11, g[k]) # Test the functions for func in funcs: # Sanity check if func(mes, n-11, g[k]) != mesecc_correct: print func.__name__, ": output is incorrect!" # Time the function total_time = 0 for m in randstr(1000, n): start = time.clock() func(m, n-k, g[k]) total_time += time.clock() - start print func.__name__, ": total time elapsed %f seconds." % total_time
And here is the result on my machine:
With PyPy: rsenc : total time elapsed 0.108183 seconds. rsenc_alt1 : output is incorrect! rsenc_alt1 : total time elapsed 0.164084 seconds. rsenc_alt2 : output is incorrect! rsenc_alt2 : total time elapsed 0.557697 seconds. Without PyPy: rsenc : total time elapsed 3.518857 seconds. rsenc_alt1 : output is incorrect! rsenc_alt1 : total time elapsed 5.630897 seconds. rsenc_alt2 : output is incorrect! rsenc_alt2 : total time elapsed 6.100434 seconds. rsenc_numpy : output is incorrect! rsenc_numpy : total time elapsed 1.631373 seconds
(Note: the alternatives should be correct, some index must be a bit off, but since they are slower anyway I did not try to fix them)
/UPDATE and goal of the bounty: I found a very interesting optimization trick that promises to speed up computations a lot: to precompute the multiplication table. I updated the code above with the new function rsenc_precomp(). However, there's no gain at all in my implementation, it's even a bit slower:
rsenc : total time elapsed 0.107170 seconds. rsenc_precomp : total time elapsed 0.108788 seconds.
How can it be that arrays lookups cost more than operations like additions or xor? Why does it work in ZFEC and not in Python?
I will attribute the bounty to whoever can show me how to make this multiplication/addition lookup-tables optimization work (faster than the xor and addition operations) or who can explain to me with references or analysis why this optimization cannot work here (using Python/PyPy/Cython/Numpy etc.. I tried them all).