I have a control problem that I am simulating in C. The premise is that I have some continuous-time signal s(t)
and I am given some length-150 array c = [c(0), c(dt), c(2dt), ...]
of controls. The array is a discretized version of a continuous-time control c(t)
, but I am not actually given access to the full c(t)
, only to the array c
with a fixed value of dt
.
Given c
and dt
, I create an array s = [s(0), s(dt), s(2dt), ...]
. I then need to compute the following function:
#include <math.h> #include <complex.h> #define SEQUENCE_LENGTH 150 double f(double s[], double c[], double dt) { // s and c are both length SEQUENCE_LENGTH arrays double complex alpha = 1. / sqrt(2) + 0.0 * I; double complex beta = 1. / sqrt(2) + 0.0 * I; double complex alphanew; double complex betanew; double x, y, norm, co, si; for (int j = 0; j < SEQUENCE_LENGTH; j++) { x = c[j]; y = s[j]; norm = sqrt(x*x + y*y); co = cos(norm * dt); si = sin(norm * dt) / norm; alphanew = co * alpha - I * si * (y * alpha + x * beta); betanew = co * beta - I * si * (x * alpha - y * beta); alpha = alphanew; beta = betanew; } return (1 / 2.) * pow(cabs(alpha + beta), 2); }
This function is called on the order of a hundred million times for various different c
and s
and it is the largest bottleneck in my simulation, so I really need to make it as fast as possible. Does anybody have any suggestions/tips/tricks to speed up this code?
Various thoughts. Currently, I compile with gcc
with the flags -Wall -Wextra -Werror -O3 -ffast-math
.
- Because I know
SEQUENCE_LENGTH
is 150, manually unrolling part of the loop would be straightforward, but I'm sure the compiler already does this. - I could perhaps vectorize some of the calculations in the loop, though again I assume the compiler is already doing this. Nonetheless I would still be interested in seeing an explicit vectorization of my code (for academic purposes), but I am not very familiar with vectorization.
- Is there any way to parallelize this code, eg with
#pragma amp parallel
? I don't see an obvious way.
sin
is divided bynorm
, whilecos
is not?\$\endgroup\$sincos
though, that's neat. The sine and cosine would be significant obstacles to manual vectorization as well, it's not impossible but please give any information that might help (eg limits on the range of the input to sin/cos, required degree of precision..) Often when there's sin+cos in a loop they can be replaced with vector rotations but looks like not this time.\$\endgroup\$c
array has entries between -1 and 1. Thes
array will typically have entires between -1 and 1 as well, although it in principle get much larger. Regarding precision, I would preferdouble
, but I guess less precision may be okay. You mentioned that sine and cosine are obstacle to manual vectorization -- can you expand on that? I don't know very much about vectorization, so I'd be curious to learn what exactly you mean here.\$\endgroup\$