Arbitrary position 2-input shuffling using SSE

Question

I have two 4 component vectors which I load into two __m128 variables. Then I need to shuffle those so that the result looks like this:

Given:

__m128 mmMin = _mm_load_ps(&glm::vec4(-1.0f,-2.0f,-3.0f,-4.0f)[0]); __m128 mmMax = _mm_load_ps(&glm::vec4(1.0f,2.0f,3.0f,4.0f)[0]);

I want the result of the shuffle to look like this:

 // {mmMin.x,mmMax.x,mmMin.x,mmMax.x}

But I see it is not possible to do with _mm_shuffle_ps.

From SSE docs I see _mm_shuffle_ps masks always inserts into result 2 values from the lower 2 components of __m128 first,then 2 from the high 2 components.

SPU intrinsics have si_shufb method which allows defining qword based mask and shuffle whatever position I wish. Is there a similar method in SSE?

I am using SSE2, but will be happy also to see how it can be done with other versions, including AVX.

Probably can use up to SSE4 but prefer staying at SSE2 level. — Michael IV, CommentedJul 16, 2019 at 11:09
Did you check what gcc's __builtin_shuffle or clang's __builtin_shufflevector generate? — Marc Glisse, CommentedJul 16, 2019 at 11:30
with that function. With what function? Which member is .x? You might need unpcklps and then a movsldup or unpcklpd to duplicate the low half. (Or shufps same,same but that's potentially slower on old CPUs, and costs an extra byte for the AVX version). To do this in 1 shuffle you might need AVX512F vpermt2ps. (These instructions all have intrinsics, but mnemonics are easier to remember and type). There aren't 2-input variable-control shuffles until AVX512F, and I don't think any of the fixed shuffles are flexible enough for you, with shufps being the closest. — Peter Cordes, CommentedJul 16, 2019 at 11:56

Peter Cordes · Accepted Answer · 2019-07-16 13:47:31Z

With only SSE2 I think you need 2 shuffles: unpcklps to interleave and then unpcklpd same,same or shufps same,same to broadcast the low 64 bits.

With AVX512F, vpermt2ps can do this in one shuffle (using a control vector); I don't think there are any 2-source shuffles in AVX2 or earlier with fine enough granularity and flexible source locations before that. And no fixed shuffles that duplicate an element along with interleaving.

2-source shuffles are rare until AVX512: mostly fixed shuffles like unpckl/h* and palignr. It's mostly just [v]shufps / [v]shufpd until then. Variable-control shuffles are also rare: until AVX, the only one is pshufb. AVX1/2 added some variable-control dword-element shuffles, but only for 1 source. There are no variable-control 2-source shuffles until AVX512.

Immediate shuffles would need more than 4 groups of 2-bit indices to handle arbitrary indexing into the concatenation of two 4-element vectors. But x86 SIMD instructions always have at most an 8-bit immediate operand. Unfortunately no broadcast-immediate like ARM has that could efficiently create a vector of 1.0f or whatever.

AVX

Since you only need 1 element from each vector, instead of loading a whole vector you can use an AVX broadcast-load and then vblendps

Broadcast-loads are the same cost as normal loads on Intel CPUs (don't cost you a uop for the shuffle port, purely handled in the load port). They can't fold into memory operands for ALU instructions until AVX512F, but they do avoid shuffle-port bottlenecks. AMD CPUs may still need an ALU uop but they have more shuffle ALUs so shuffle throughput isn't a bottleneck nearly as much. (https://agner.org/optimize/)

Ryzen vbroadcastss xmm, [mem] is 2 separate uops for the front-end unfortunately, but it still has 2-per-clock throughput.

blend-immediate on dword and later elements is very efficient and can run on any port on Haswell and later, or 2 ports on SnB/IvB and Ryzen. But still single uop / 1c latency even on Nehalem.

#include <immintrin.h> __m128 broadcast_interleave_scalars_avx(const float *min, const float *max) { __m128 minx = _mm_broadcast_ss(min); __m128 maxx = _mm_broadcast_ss(max); return _mm_blend_ps(minx, maxx, 0b1010); }

On Godbolt, clang's asm comments confirm that I got the blend constant right:

 vbroadcastss xmm0, dword ptr [rdi] vbroadcastss xmm1, dword ptr [rsi] vblendps xmm0, xmm0, xmm1, 10 # xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3]

If your data was already in registers, not freshly loaded, you might want to just use 2 shuffles.

With SSE4.1 you might be able to do 2x movddup loads to broadcast 64 bits from memory (including the 32 bits you care about) then blendps. The first load will load 32 bits past the float you care about, the 2nd will load 32 bits before the float you care about.

To get a C++ compiler to emit this for you you'll have to pointer-cast to double* for the __m128d _mm_loaddup_pd (double const* mem_addr) loads, and then use _mm_castpd_ps to get __m128 from __m128d.

https://www.felixcloutier.com/x86/movsldup could also be useful to set up for unpcklps.

Nice trick with _mm_broadcast_ss, though I hoped there would be SPU's like single instruction command to perform such a shuffle. — Michael IV, CommentedJul 16, 2019 at 13:33
@MichaelIV: There is, with AVX512 vpermt2d. SPU is PowerPC Altivec, right? Unfortunately x86 SIMD has very limited 2-source shuffles until AVX512F, nothing that matches the flexibility of selecting 16 bytes from the concatenation of 2x 16-byte vectors. (Until AVX512VBMI vpermt2b) — Peter Cordes, CommentedJul 16, 2019 at 13:49
Btw, two-input variable shuffles were kinda impossible prior to Skylake anyway due to the 2-input uop limitation. The FMAs were very hacky to begin with. — Mysticial, CommentedJul 16, 2019 at 15:44
@Mysticial: Broadwell added support for 1-uop cmov, adc, sbb. (And ADCX/ADOX). But beyond FMA, no 3-input vector instructions; 1-uop non-VEX blendvps xmm, xmm (implicit XMM0) was new with Skylake. Anyway yeah, good point about uop input limitations. So that combined with an imm8 not having enough bits for a fully-flexible 2-input shuffle pretty much rules out any hope for Intel adding it as either immediate or vector-controlled in AVX2 or earlier. There might have been an AMD XOP shuffle like this in Bulldozer-family, but that's nearly irrelevant at this point. — Peter Cordes, CommentedJul 16, 2019 at 16:00
I recently pulled XOP from my Bulldozer-optimized binaries since it was getting harder and harder to test them. (I rarely ever turn on my one Bulldozer box anymore.) Zen 1/+ still (secretly) supported FMA4. But they took even that out of Zen 2. So I'm kinda sad. :( — Mysticial, CommentedJul 16, 2019 at 16:13

Collectives™ on Stack Overflow

Arbitrary position 2-input shuffling using SSE

1 Answer 1

AVX

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

AVX

Related