SSE3 in DC++
October 26, 20163 Comments
The next DC++ release will require SSE3. Steam’s hardware survey currently lists SSE3 as having 99.96% penetration. All AMD and Intel x86 CPUs since the Athlon 64 X2 in 2005 and Intel Core in January 2006 have supported SSE3. Even earlier, though, all Pentium 4 steppings since Prescott which support the NX bit required by Windows 8 and 10 also support SSE3, which extends the effective Intel support back to 2004. I can’t find an Intel CPU which supports NX (required for Win8/10) but not SSE3. Finally, this effectively affects only 32-bit builds, since 64-bit builds exclusively use SSE for floating-point arithmetic.
This effects two basic transformations, one minor and one major, depending on how well the existing code compiles. The minor improvement derives from functions such as bool SettingsDialog::handleClosing() using one instruction rather than two, from
bool SettingsDialog::handleClosing() { dwt::Point pt = getWindowSize(); SettingsManager::getInstance()->set(SettingsManager::SETTINGS_WIDTH, cvttss2si eax,DWORD PTR [esp+0x18] ;; eax is just a temporary mov DWORD PTR [edx+0x87c],eax ;; which is promptly stored to mem
to
bool SettingsDialog::handleClosing() { dwt::Point pt = getWindowSize(); SettingsManager::getInstance()->set(SettingsManager::SETTINGS_WIDTH, fisttp DWORD PTR [edx+0x87c] ;; no byway through eax (also, less register pressure)
However, sometimes cvttss2si and related SSE/SSE2 instructions don’t fit as well, so g++ had been relying on fistp. These instances previously produced terrible code generation; without SSE3, only using through SSE2, part of void SearchFrame::runSearch() compiles to:
auto llsize = static_cast(lsize); fnstcw WORD PTR [ebp-0x50e] ;; save FP control word to mem movzx eax,WORD PTR [ebp-0x50e] ;; zero-extend-move it to eax mov ah,0xc ;; build new control word mov WORD PTR [ebp-0x510],ax ;; place control word in mem for fldcw fld QWORD PTR [ebp-0x520] ;; load lsize from mem (same as below) fldcw WORD PTR [ebp-0x510] ;; load new control wordfistp QWORD PTR [ebp-0x548] ;; with correct control word, round lsize fldcw WORD PTR [ebp-0x50e] ;; restore previous control word
All 6 red-highlighted lines just scaffold around the actual fistp doing the floating point-to-int rounding, which can cost 80 cycles or more for this single innocuous-looking line of code. By contrast, using fisttp from SSE3, that same fragment collapses to:
auto llsize = static_cast(lsize); fld QWORD PTR [ebp-0x520] ;; same as above; load lsize fisttp QWORD PTR [ebp-0x548] ;; convert it. simple.
This pattern recurs many times through DC++, including void AdcHub::handle(AdcCommand::GET which has a portion halving in size and dramatically increasing in speed from
// Ideal size for m is n * k / ln(2), but we allow some slack // When h >= 32, m can't go above 2^h anyway since it's stored in a size_t. if(m > (5 * Util::roundUp((int64_t)(n * k / log(2.)), (int64_t)64)) || (h < 32 && m > static_cast(1U << h))) { mov DWORD PTR [esp+0x1c],edi xor ecx,ecx imul eax,DWORD PTR [esp+0x18] movd xmm0,eax movq QWORD PTR [esp+0x58],xmm0 fild QWORD PTR [esp+0x58] fdiv QWORD PTR ds:0xca8 fnstcw WORD PTR [esp+0x22] ;; same control word dance as before movzx eax,WORD PTR [esp+0x22] mov ah,0xc ;; same control word mov WORD PTR [esp+0x20],ax ;; but fldcw loads from mem not reg fldcw WORD PTR [esp+0x20] ;; load C and C++-compatible rounding mode fistp QWORD PTR [esp+0x58] ;; the actual conversion fldcw WORD PTR [esp+0x22] ;; restore previous mov eax,DWORD PTR [esp+0x58] mov edx,DWORD PTR [esp+0x5c]
to, using the fisttp SSE3 instruction,
// Ideal size for m is n * k / ln(2), but we allow some slack // When h >= 32, m can't go above 2^h anyway since it's stored in a size_t. if(m > (5 * Util::roundUp((int64_t)(n * k / log(2.)), (int64_t)64)) || (h < 32 && m > static_cast(1U << h))) { mov DWORD PTR [esp+0x20],edi xor ecx,ecx imul eax,DWORD PTR [esp+0x1c] movd xmm0,eax movq QWORD PTR [esp+0x58],xmm0 fild QWORD PTR [esp+0x58] fdiv QWORD PTR ds:0xca8 fisttp QWORD PTR [esp+0x58] ;; replaces all seven red lines mov eax,DWORD PTR [esp+0x58] mov edx,DWORD PTR [esp+0x5c]
This specific control word save/convert float/control word restore pattern recurs 19 other times across the current codebase in the dcpp, dwt, and win32 directories, including DownloadManager::getRunningAverage(); HashBloom::get_m(size_t n, size_t k); QueueItem::getDownloadedBytes(); Transfer::getParams(…); UploadManager::getRunningAverage(); Grid::calcSizes(…); HashProgressDlg::updateStats(); TransferView::on(HttpManagerListener::Updated, …); and TransferView::onTransferTick(…).
Know your FPU: Fixing Floating Fast provides microbenchmarks showing just how slow this fistp-based technique can be due to the fnstcw/fldcw 80+-cycle FPU pipeline flush and therefore how much faster code which replaces it can become:
Fixed tests... Testing ANSI fixed() ... Time = 2974.57 ms Testing fistp fixed()... Time = 3100.84 ms Testing Sree fixed() ... Time = 606.80 ms
SSE3 provides not simply some hidden code generation aesthetic quality improvement, but a speed increase across much of DC++.
You must be logged in to post a comment.