Canonical sign-replication operation #437

Maratyszcza · 2021-02-01T17:45:45Z

Sign-replication is an often-used operation that replicates the sign bit of a SIMD lane into all bits of the lane. There are two reasons why we need to pay attention to sign-replication. First, it can be encoded in WebAssembly SIMD in several ways:

i8x16.shr_s(v, -1)/i16x8.shr_s(v, -1)/i32x4.shr_s(v, -1)/i64x2.shr_s(v, -1)
i8x16.shr_s(v, 7)/i16x8.shr_s(v, 15)/i32x4.shr_s(v, 31)/i64x2.shr_s(v, 63)
i8x16.neg(i8x16.shr_u(v, -1))/i16x8.neg(i16x8.shr_u(v, -1))/i32x4.neg(i32x4.shr_s(v, -1))/i64x2.neg(i64x2.shr_s(v, -1))
i8x16.neg(i8x16.shr_u(v, 7))/i16x8.neg(i16x8.shr_u(v, 15))/i32x4.neg(i32x4.shr_s(v, 31))/i64x2.neg(i64x2.shr_s(v, 63))
i8x16.lt_s(v, v128.const(0))/i16x8.lt_s(v, v128.const(0))/i32x4.lt_s(v, v128.const(0))/i64x2.lt_s(v, v128.const(0))
Secondly, sign-replication can be lowered in many ways depending on the data type and the target instruction set, as noted by @jan-wassenberg in Sign Select instructions #124.

My suggestion is:

To standardize i8x16.shr_s(v, -1)/i16x8.shr_s(v, -1)/i32x4.shr_s(v, -1)/i64x2.shr_s(v, -1) and i8x16.shr_s(v, 7)/i16x8.shr_s(v, 15)/i32x4.shr_s(v, 31)/i64x2.shr_s(v, 63) as the canonical sign-replication instructions, and recommend that WebAssembly engines lower them differently that other arithmetic shift instructions.
To provide an informative recommendation on optimal lowering depending on the instruction set (see below).

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX512F and AVX512VL instruction sets

i64x2.shr_s(v, -1)/i64x2.shr_s(v, 63)
- y = i64x2.shr_s(x, 63) is lowered to VPSRAQ xmm_y, xmm_x, 63

x86/x86-64 processors with AVX instruction set

i8x16.shr_s(v, -1)/i8x16.shr_s(v, 7)
- y = i8x16.shr_s(x, 7) (y is NOT x) is lowered to:
  - VPXOR xmm_y, xmm_y, xmm_y
  - VPCMPGTB xmm_y, xmm_y, xmm_x
- x = i8x16.shr_s(x, 7) is lowered to:
  - VPXOR xmm_tmp, xmm_tmp, xmm_tmp
  - VPCMPGTB xmm_x, xmm_tmp, xmm_x
i16x8.shr_s(v, -1)/i16x8.shr_s(v, 15)
- y = i16x8.shr_s(x, 15) is lowered to VPSRAW xmm_y, xmm_x, 15
i32x4.shr_s(v, -1)/i32x4.shr_s(v, 31)
- y = i32x4.shr_s(x, 31) is lowered to VPSRAD xmm_y, xmm_x, 31
i64x2.shr_s(v, -1)/i64x2.shr_s(v, 63)
- y = i64x2.shr_s(x, 63) is lowered to:
  - VPSRAD xmm_y, xmm_x, 31
  - VPSHUFD xmm_y, xmm_y, 0xF5

x86/x86-64 processors with SSE2 instruction set

i8x16.shr_s(v, -1)/i8x16.shr_s(v, 7)
- y = i8x16.shr_s(x, 7) (y is NOT x) is lowered to:
  - PXOR xmm_y, xmm_y
  - PCMPGTB xmm_y, xmm_x
- x = i8x16.shr_s(x, 7) is lowered to:
  - MOVDQA xmm_tmp, xmm_x
  - PXOR xmm_x, xmm_x
  - PCMPGTB xmm_x, xmm_tmp
i16x8.shr_s(v, -1)/i16x8.shr_s(v, 15)
- y = i16x8.shr_s(x, 15) is lowered to:
  - PXOR xmm_y, xmm_y
  - PCMPGTW xmm_y, xmm_x
- x = i16x8.shr_s(x, 15) is lowered to PSRAW xmm_x, 15
i32x4.shr_s(v, -1)/i32x4.shr_s(v, 31)
- y = i32x4.shr_s(x, 31) is lowered to:
  - PXOR xmm_y, xmm_y
  - PCMPGTD xmm_y, xmm_x
- x = i32x4.shr_s(x, 31) is lowered to PSRAD xmm_x, 31
i64x2.shr_s(v, -1)/i64x2.shr_s(v, 63)
- y = i64x2.shr_s(x, 63) is lowered to:
  - PSHUFD xmm_y, xmm_x, 0xF5
  - PSRAD xmm_y, 31

ARM64 processors

i8x16.shr_s(v, -1)/i8x16.shr_s(v, 7)
- y = i8x16.shr_s(x, 7) is lowered to CMLT Vy.16B, Vx.16B, #0
i16x8.shr_s(v, -1)/i16x8.shr_s(v, 15)
- y = i16x8.shr_s(x, 15) is lowered to CMLT Vy.8H, Vx.8H, #0
i32x4.shr_s(v, -1)/i32x4.shr_s(v, 31)
- y = i32x4.shr_s(x, 31) is lowered to CMLT Vy.4S, Vx.4S, #0
i64x2.shr_s(v, -1)/i64x2.shr_s(v, 63)
- y = i64x2.shr_s(x, 63) is lowered to CMLT Vy.2D, Vx.2D, #0

ARMv7 processors with NEON extension

i8x16.shr_s(v, -1)/i8x16.shr_s(v, 7)
- y = i8x16.shr_s(x, 7) is lowered to VCLT.S8 Qy, Qx, #0
i16x8.shr_s(v, -1)/i16x8.shr_s(v, 15)
- y = i16x8.shr_s(x, 15) is lowered to VCLT.S16 Qy, Qx, #0
i32x4.shr_s(v, -1)/i32x4.shr_s(v, 31)
- y = i32x4.shr_s(x, 31) is lowered to VCLT.S32 Qy, Qx, #0
i64x2.shr_s(v, -1)/i64x2.shr_s(v, 63)
- y = i64x2.shr_s(x, 63) is lowered to VSHR.S64 Qy, Qx, #63

The text was updated successfully, but these errors were encountered:

jan-wassenberg · 2021-02-01T19:09:37Z

@Maratyszcza Interesting. Do I understand correctly that an engine would have to check whether the shift count matches the magic value, and then emit the corresponding code? It seems that might hit branch mispredicts if mixed with regular non-magic shift operations?
Is there a particular concern about adding another instruction, or are you hoping to benefit from the special case in normal shifts that happen to use the magic shift amount?

tlively · 2021-02-01T21:27:42Z

I would hope that engines would be able recognize that e.g. i8x16.shr_s(v, -1) is the same as i8x16.shr_s(v, 7). FWIW, LLVM will never emit the former and already optimizes the lt_s patterns to their shift equivalents. If these shifts-by-constant patterns are important for performance, we should certainly document a recommendation that engines treat them specially. This ties in to the documentation recommendation in #430.

Maratyszcza · 2021-02-02T20:17:04Z

Do I understand correctly that an engine would have to check whether the shift count matches the magic value, and then emit the corresponding code? It seems that might hit branch mispredicts if mixed with regular non-magic shift operations?

Engines already treat shifts with static count differently than shifts with dynamic count. What I suggest is an additional check + codegen specialization when shift count is static. Since this is a compile-time decision, no branches will be generated in the translated code.

Maratyszcza · 2021-02-02T20:19:05Z

Is there a particular concern about adding another instruction, or are you hoping to benefit from the special case in normal shifts that happen to use the magic shift amount?

We already locked the instruction proposals in November, so any new instruction proposal would have to wait until SIMD post-MVP.

dtig · 2021-02-02T20:39:24Z

We don't have a way to standardize certain immediate patterns as the standard for operations, this is something engines could optimize as described. I'll label this as perf documentation for now, and as we figure out the best way to surface this information. Perhaps in #436 if we run through other things already on the agenda.

@Maratyszcza Interesting. Do I understand correctly that an engine would have to check whether the shift count matches the magic value, and then emit the corresponding code? It seems that might hit branch mispredicts if mixed with regular non-magic shift operations?
Is there a particular concern about adding another instruction, or are you hoping to benefit from the special case in normal shifts that happen to use the magic shift amount?

For this particular case, the overhead of adding operations does not seem necessary as the optimization is a special case of the shift operation and can be handled in a relatively straightforward way in the engine.

jan-wassenberg · 2021-02-03T08:12:07Z

@Maratyszcza @dtig @tlively I understand, thanks for explaining. Documentation + internal optimization in engines sounds good!

dtig added the perf documentation label Feb 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canonical sign-replication operation #437

Canonical sign-replication operation #437

Maratyszcza commented Feb 1, 2021 •

edited

jan-wassenberg commented Feb 1, 2021

tlively commented Feb 1, 2021

Maratyszcza commented Feb 2, 2021

Maratyszcza commented Feb 2, 2021

dtig commented Feb 2, 2021

jan-wassenberg commented Feb 3, 2021

Canonical sign-replication operation #437

Canonical sign-replication operation #437

Comments

Maratyszcza commented Feb 1, 2021 • edited

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX512F and AVX512VL instruction sets

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON extension

jan-wassenberg commented Feb 1, 2021

tlively commented Feb 1, 2021

Maratyszcza commented Feb 2, 2021

Maratyszcza commented Feb 2, 2021

dtig commented Feb 2, 2021

jan-wassenberg commented Feb 3, 2021

Maratyszcza commented Feb 1, 2021 •

edited