In this post we will modify our convolution kernel to take advantage of the SSE instructions available on modern CPUs.
As before, we have made only incremental changes to the previous files, so you can do a diff to get a quick view of what additional code is present. The only significant change is the addition of new kernels to the kernels file.
Vectorization
Let us revisit the Convolve_Unroll kernel from the Loop Unrolling post. The while loop [while (c <= filterWidth-4)] performs four multiplications and four additions, each multiplication being followed by one addition. We can reorder the statements in the loop in a small way to get the same result – we will calculate four multiplications first, followed by four additions. :
__kernel void Convolve_Unroll(const __global float * pInput, __constant float * pFilter, __global float * pOutput, const int nInWidth, const int nFilterWidth) { const int nWidth = get_global_size(0); const int xOut = get_global_id(0); const int yOut = get_global_id(1); const int xInTopLeft = xOut; const int yInTopLeft = yOut; float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; for (int r = 0; r < nFilterWidth; r++) { const int idxFtmp = r * nFilterWidth; const int yIn = yInTopLeft + r; const int idxIntmp = yIn * nInWidth + xInTopLeft; int c = 0; while (c <= nFilterWidth-4) { float mul0, mul1, mul2, mul3; int idxF = idxFtmp + c; int idxIn = idxIntmp + c; mul0 = pFilter[idxF]*pInput[idxIn]; idxF++; idxIn++; mul1 += pFilter[idxF]*pInput[idxIn]; idxF++; idxIn++; mul2 += pFilter[idxF]*pInput[idxIn]; idxF++; idxIn++; mul3 += pFilter[idxF]*pInput[idxIn]; sum0 += mul0; sum1 += mul1; sum2 += mul2; sum3 += mul3; c += 4; } for (int c1 = c; c1 < nFilterWidth; c1++) { const int idxF = idxFtmp + c1; const int idxIn = idxIntmp + c1; sum0 += pFilter[idxF]*pInput[idxIn]; } } //for (int r = 0… const int idxOut = yOut * nWidth + xOut; pOutput[idxOut] = sum0 + sum1 + sum2 + sum3; }
Instead of four products and four additions, it is possible to use one SSE packed-multiply and one packed-add to get the same result. This saves three multiply and three add instructions. Since the saved instructions are in the innermost loop, this is huge. Consider our 8192×8192 image – the savings is 8192×8192 x filterWidth x (filterWidth/4) x 6 instructions. (We are ignoring additional load/store instructions that will be required to use SSE packed instructions).
Vectorized Kernel
But how do we use SSE in the OpenCL™ kernel? AMD’s OpenCL implementation will try to use SSE ‘packed-arithmetic’ instructions whenever it encounters a vector datatype in the kernel. For the convolution example, we will use a float4 multiply and a float4 add in the inner loop. Compare the following kernel with the modified Convolve_Unroll kernel shown above. Both are similar, except for the use of a float4 datatype.
__kernel void Convolve_Float4(const __global float * pInput, __constant float * pFilter, __global float * pOutput, const int nInWidth, const int nFilterWidth) { const int nWidth = get_global_size(0); const int xOut = get_global_id(0); const int yOut = get_global_id(1); const int xInTopLeft = xOut; const int yInTopLeft = yOut; float4 sum4 = 0; for (int r = 0; r < nFilterWidth; r++) { const int idxFtmp = r * nFilterWidth; const int yIn = yInTopLeft + r; const int idxIntmp = yIn * nInWidth + xInTopLeft; int c = 0; int c4 = 0; while (c <= nFilterWidth-4) { float4 filter4 = vload4(c4, pFilter+idxFtmp); float4 in4 = vload4(c4, pInput +idxIntmp); sum4 += in4 * filter4; c += 4; c4++; } for (int c1 = c; c1 < nFilterWidth; c1++) { const int idxF = idxFtmp + c1; const int idxIn = idxIntmp + c1; sum4.x += pFilter[idxF]*pInput[idxIn]; } } //for (int r = 0… const int idxOut = yOut * nWidth + xOut; pOutput[idxOut] = sum4.x + sum4.y + sum4.z + sum4.w; }
Performance
We compare the Convolve_Float4 kernel performance to the Convolve kernel.*

Vectorized Kernel (2)
We can further optimize the kernel by using an if-else statement instead of the second inner loop [for (int c1 = c; c1 < filterWidth; c1++)]. This is similar to how we changed Convolve_Unroll kernel to Convolve_UnrollIf kernel.
Yet another option is to write and build four different kernels for the four different cases of filterWidth%4 (= 0,1,2 or 3). At runtime, depending on the actual filter width, we can call the appropriate kernel.
The Convolve_Float4If kernel is given below. It is possible to futher optimize this kernel by changing the code in the if-else clauses to use float4datatypes.
__kernel void Convolve_Float4If(const __global float * pInput, __constant float * pFilter, __global float * pOutput, const int nInWidth, const int nFilterWidth) { const int nWidth = get_global_size(0); const int xOut = get_global_id(0); const int yOut = get_global_id(1); const int xInTopLeft = xOut; const int yInTopLeft = yOut; float4 sum4 = 0; for (int r = 0; r < nFilterWidth; r++) { const int idxFtmp = r * nFilterWidth; const int yIn = yInTopLeft + r; const int idxIntmp = yIn * nInWidth + xInTopLeft; int c = 0; int c4 = 0; while (c <= nFilterWidth-4) { float4 filter4 = vload4(c4, pFilter+idxFtmp); float4 in4 = vload4(c4, pInput +idxIntmp); sum4 += in4 * filter4; c += 4; c4++; } int cMod = nFilterWidth – c; if (cMod == 1) { int idxF = idxFtmp + c; int idxIn = idxIntmp + c; sum4.x += pFilter[idxF]*pInput[idxIn]; } else if (cMod == 2) { //Use float4 here to further optimize the kernel int idxF = idxFtmp + c; int idxIn = idxIntmp + c; sum4.x += pFilter[idxF]*pInput[idxIn]; sum4.y += pFilter[idxF+1]*pInput[idxIn+1]; } else if (cMod == 3) { //Use float4 here to further optimize the kernel int idxF = idxFtmp + c; int idxIn = idxIntmp + c; sum4.x += pFilter[idxF]*pInput[idxIn]; sum4.y += pFilter[idxF+1]*pInput[idxIn+1]; sum4.z += pFilter[idxF+2]*pInput[idxIn+2]; } } //for (int r = 0… const int idxOut = yOut * nWidth + xOut; pOutput[idxOut] = sum4.x + sum4.y + sum4.z + sum4.w; }
Performance (2)
Kernel Convolve_Float4If gives us the best performance so far. Just like the loop unrolled kernels, we see that the speedups are better for larger filters; small filters actually suffer from the overhead of additional special case handling.*

Vectorized Kernel with Invariants (3)
We can use the idea from the Invariants write-up to reduce the times for small filters. Instead of passing filterWidth as an argument to the kernel, we will define the value for FILTER_WIDTH when we build the OpenCL program object. That change gives us the kernels Convolve_Def_Float4 and Convolve_Def_Float4If.*

OpenMP comparison
So how do these results compare to the convolution code written in C/C++? We will compare the OpenCL kernel performance against a multi-threaded (OpenMP) implementation of the Convolve() function as given in this article. The test machine is a quad-core AMD Phenom™ X4 9950 Black Edition processor with 8GB RAM.*
While many of the optimizations discussed here are equally useful for the C/C++ implementation, our objective for this comparison is to see whether the OpenCL kernel can give us competitive performance by transparently using multi-threading and SSE capabilities of the CPU, without requiring much coding effort from the user.
This graph shows the OpenMP (multi-threaded, 4-core) performance against the simple, unrolled and the invariant versions of the OpenCL kernels. The 4-core OpenMP performance is the baseline at 100%; single-threaded C code performance is approximately at the 400% line. Compared to a single-threaded implementation, the OpenCL kernels get good performance because they are using all the 4-cores in the CPU.

The next graph compares the OpenMP (multi-threaded, 4-core) performance with the vectorized kernel implementations. We know these kernels are compiled into SSE code for the CPU by the AMD OpenCL implementation.
