With only SSE2 I think you need 2 shuffles: unpcklps
to interleave and then unpcklpd same,same
or shufps same,same
to broadcast the low 64 bits.
With AVX512F, vpermt2ps
can do this in one shuffle (using a control vector); I don't think there are any 2-source shuffles in AVX2 or earlier with fine enough granularity and flexible source locations before that. And no fixed shuffles that duplicate an element along with interleaving.
2-source shuffles are rare until AVX512: mostly fixed shuffles like unpckl/h*
and palignr
. It's mostly just [v]shufps
/ [v]shufpd
until then. Variable-control shuffles are also rare: until AVX, the only one is pshufb
. AVX1/2 added some variable-control dword-element shuffles, but only for 1 source. There are no variable-control 2-source shuffles until AVX512.
Immediate shuffles would need more than 4 groups of 2-bit indices to handle arbitrary indexing into the concatenation of two 4-element vectors. But x86 SIMD instructions always have at most an 8-bit immediate operand. Unfortunately no broadcast-immediate like ARM has that could efficiently create a vector of 1.0f or whatever.
AVX
Since you only need 1 element from each vector, instead of loading a whole vector you can use an AVX broadcast-load and then vblendps
Broadcast-loads are the same cost as normal loads on Intel CPUs (don't cost you a uop for the shuffle port, purely handled in the load port). They can't fold into memory operands for ALU instructions until AVX512F, but they do avoid shuffle-port bottlenecks. AMD CPUs may still need an ALU uop but they have more shuffle ALUs so shuffle throughput isn't a bottleneck nearly as much. (https://agner.org/optimize/)
Ryzen vbroadcastss xmm, [mem]
is 2 separate uops for the front-end unfortunately, but it still has 2-per-clock throughput.
blend-immediate on dword and later elements is very efficient and can run on any port on Haswell and later, or 2 ports on SnB/IvB and Ryzen. But still single uop / 1c latency even on Nehalem.
#include <immintrin.h> __m128 broadcast_interleave_scalars_avx(const float *min, const float *max) { __m128 minx = _mm_broadcast_ss(min); __m128 maxx = _mm_broadcast_ss(max); return _mm_blend_ps(minx, maxx, 0b1010); }
On Godbolt, clang's asm comments confirm that I got the blend constant right:
vbroadcastss xmm0, dword ptr [rdi] vbroadcastss xmm1, dword ptr [rsi] vblendps xmm0, xmm0, xmm1, 10 # xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3]
If your data was already in registers, not freshly loaded, you might want to just use 2 shuffles.
With SSE4.1 you might be able to do 2x movddup
loads to broadcast 64 bits from memory (including the 32 bits you care about) then blendps
. The first load will load 32 bits past the float
you care about, the 2nd will load 32 bits before the float
you care about.
To get a C++ compiler to emit this for you you'll have to pointer-cast to double*
for the __m128d _mm_loaddup_pd (double const* mem_addr)
loads, and then use _mm_castpd_ps
to get __m128
from __m128d
.
https://www.felixcloutier.com/x86/movsldup could also be useful to set up for unpcklps
.
__builtin_shuffle
or clang's__builtin_shufflevector
generate?.x
? You might needunpcklps
and then amovsldup
orunpcklpd
to duplicate the low half. (Orshufps same,same
but that's potentially slower on old CPUs, and costs an extra byte for the AVX version). To do this in 1 shuffle you might need AVX512Fvpermt2ps
. (These instructions all have intrinsics, but mnemonics are easier to remember and type). There aren't 2-input variable-control shuffles until AVX512F, and I don't think any of the fixed shuffles are flexible enough for you, withshufps
being the closest.