I have the following code:
function TSliverHelper.SlowNorth: TSlice; var i: integer; begin // Add pixels 0,1,2 // This means expanding every bit into a byte // Or rather every byte into an int64; for i:= 0 to 7 do begin Result.Data8[i]:= TSuperSlice.Lookup012[Self.bytes[i]]; end; end;
This uses a straight forward lookup table, but obviously LUT's are slow and clobber the cache. This takes about 2860 millisecs for 100.000.000 items.
The following approach is a bit faster (1797 MS, or 37% faster):
function TSliverHelper.North: TSlice; const SliverToSliceMask: array[0..7] of byte = ($01,$02,$04,$08,$10,$20,$40,$80); asm //RCX = @Self (a pointer to an Int64) //RDX = @Result (a pointer to an array[0..63] of byte) movq xmm0,[rcx] //Get the sliver mov r9,$8040201008040201 movq xmm15,r9 //[rip+SliverToSliceMask] //Get the mask movlhps xmm15,xmm15 //extend it mov r8,$0101010101010101 //Shuffle mask movq xmm14,r8 //00 00 00 00 00 00 00 00 01 01 01 01 01 01 01 01 pslldq xmm14,8 //01 01 01 01 01 01 01 01 00 00 00 00 00 00 00 00 movdqa xmm1,xmm0 //make a copy of the sliver //bytes 0,1 pshufb xmm1,xmm14 //copy the first two bytes across pand xmm1,xmm15 //Mask off the relevant bits pcmpeqb xmm1,xmm15 //Expand a bit into a byte movdqu [rdx],xmm1 //bytes 2,3 psrldq xmm0,2 //shift in the next two bytes movdqa xmm2,xmm0 pshufb xmm2,xmm14 //copy the next two bytes across pand xmm2,xmm15 //Mask off the relevant bits pcmpeqb xmm2,xmm15 //Expand a bit into a byte movdqu [rdx+16],xmm2 //bytes 4,5 psrldq xmm0,2 //shift in the next two bytes movdqa xmm3,xmm0 pshufb xmm3,xmm14 //copy the next two bytes across pand xmm3,xmm15 //Mask off the relevant bits pcmpeqb xmm3,xmm15 //Expand a bit into a byte movdqu [rdx+32],xmm3 //bytes 6,7 psrldq xmm0,2 //shift in the next two bytes movdqa xmm4,xmm0 pshufb xmm4,xmm14 //copy the final two bytes across pand xmm4,xmm15 //Mask off the relevant bits pcmpeqb xmm4,xmm15 //Expand a bit into a byte //Store the data movdqu [rdx+48],xmm4 end;
However, that is a lot of code. I'm hoping there's a way to do with less processing that's going to work faster. The way the code works (in prose) is simple.
First we clone the input byte 8 times. Next the bit is masked off using the 01,02,04... mask and an AND operation. Finally this randomish bit is expanded into a byte using the compare-equal-to-mask (pcmpeqb).
The opposite operation is a simple PMSKMOVB
.
I can use AVX1 code, but not AVX2.