I discovered this popular ~9-year-old SO question and decided to double-check its outcomes.
So, I have AMD Ryzen 9 5950X, clang++ 10 and Linux, I copy-pasted code from the question and here is what I got:
Sorted - 0.549702s:
~/d/so_sorting_faster$ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out std::sort(data, data + arraySize); 0.549702 sum = 314931600000
Unsorted - 0.546554s:
~/d/so_sorting_faster $ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out // std::sort(data, data + arraySize); 0.546554 sum = 314931600000
I am pretty sure that the fact that unsorted version turned out to be faster by 3ms is just noise, but it seems it is not slower anymore.
So, what has changed in the architecture of CPU (so that it is not an order of magnitude slower anymore)?
Here are results from multiple runs:
Unsorted: 0.543557 0.551147 0.541722 0.555599 Sorted: 0.542587 0.559719 0.53938 0.557909
Just in case, here is my main.cpp:
#include <algorithm> #include <ctime> #include <iostream> int main() { // Generate data const unsigned arraySize = 32768; int data[arraySize]; for (unsigned c = 0; c < arraySize; ++c) data[c] = std::rand() % 256; // !!! With this, the next loop runs faster. // std::sort(data, data + arraySize); // Test clock_t start = clock(); long long sum = 0; for (unsigned i = 0; i < 100000; ++i) { // Primary loop for (unsigned c = 0; c < arraySize; ++c) { if (data[c] >= 128) sum += data[c]; } } double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC; std::cout << elapsedTime << std::endl; std::cout << "sum = " << sum << std::endl; return 0; }
Update
With larger number of elements (627680):
Unsorted cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out // std::sort(data, data + arraySize); 10.3814 Sorted: cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out std::sort(data, data + arraySize); 10.6885
I think the question is still relevant - almost no difference.
-O1
includes the vectorization optimization. That's interesting!-O2
to auto-vectorize, it seems, but even at-O1
it generates branchless scalar code: see the conditional movecmovle
at line 40, whereedx
containsdata[c]
andr15d
is zero.