i64x2.widen_(low/high)_i32x4_(s/u) instructions extensions for 8-bit and 16-bit integers #372

omnisip · 2020-10-06T16:02:24Z

Introduction

This proposal mirrors #290 to add new variants of existing widen instructions and extends the 32 and 64 widen instructions to include support from 16 and 8-bit integers. The practical use case for this is signal processing -- specifically audio and image processing, but the use cases for this are pretty large in general. For a non-image processing use case, these could be very helpful any time someone wants to convert an 8-bit value to a floating-point number. Currently, this requires multiple conversions steps between integers before converting to float, but modern architectures provide operations to convert from just about any integer size to another. Due to the non-binary relationship between 8 bits and 64 bits, this instruction will introduce new terminology that will replace the high/low terminology with a constant parameter immediate. This ticket will serve as the foundation for the PR that follows and will be updated with implementation details for each instruction set.

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE4.1 instruction set

ARM64 processors

ARMv7 processors with NEON instruction set

The text was updated successfully, but these errors were encountered:

Maratyszcza · 2020-10-31T00:21:11Z

This operation doesn't have a direct equivalent in ARM NEON or ARM64

omnisip · 2020-10-31T01:20:56Z

I'm still investigating myself, but I have found what seems to be an efficient solution for 16 8-bit integers to 16 32bit integers. You can see the Godbolt here. (Edited to fix a typo in Godbolt. This is definitely a work in progress.)

omnisip · 2020-10-31T02:29:25Z

In this Godbolt there are three samples showcasing different implementations on Intel. It's really fascinating too since the shortest implementation underperforms both of the longer implementations because pmovzxbd and palignr block the shuffle port. The two longer ones finish in 604 cycles whereas the shortest one takes 704.

omnisip · 2020-11-06T02:09:29Z

@Maratyszcza Please take a look at this Godbolt.

Regarding your earlier point about missing ARM instructions, you're right. However, If you look at this analysis you'll see that pmovsxbd and pmovzxbd don't provide any consequential benefits when working with 2 xmms. Their benefit is only useful with respect to going to and from memory. However, that benefit even then is of limited significance on machines with only 1 shuffle port.

With respect to ARM, the 6 instruction pass that yields 4 vectors of 32bit integers is most efficient in two specific cases:

The goal is to end up with only 4 32bit vectors (signed or unsigned).
In any and all signed variants.

This is mostly because you don't get the individual 32bit conversion efficiency seen with lesser ops, or the large unsigned 32bit cases, where you can take advantage of the table indices multiple times. If the load of the indices only needs to happen once, the table transform will outperform in all cases where the data output will remain unsigned. The signed versus unsigned distinction is important on ARM64 since sshr competes with the tbl operations.

On the other hand, x64 chips get tremendous performance advantages no matter which way you look at it, and that's despite a great increase in the number of instructions. Going from 6 shuffles (punpck* + pxor) or 7 shuffles (4 pmovsxbd + 3 palignr) to a 4 shuffle solution translates to a minimum of a 30% performance improvement. In real terms, that means 600-700 cycles becoming 409 on Skylake. And if you need to keep the data signed? No problem. LLVM-MCA is reporting 410 cycles when adding 4 extra psrad instructions per loop.

omnisip changed the title ~~i64x2.widen_(low/high)_i32x4_(s/u) instructions extensions for 8-bit and 64-bit integers~~ i64x2.widen_(low/high)_i32x4_(s/u) instructions extensions for 8-bit and 16-bit integers Oct 6, 2020

omnisip added a commit to omnisip/simd that referenced this issue Nov 8, 2020

WebAssembly#372 Integer Sign/Zero Extension

e24b7a7

omnisip mentioned this issue Nov 8, 2020

#372 Integer Sign/Zero Extension for {8,16}->{32,64} #395

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

i64x2.widen_(low/high)_i32x4_(s/u) instructions extensions for 8-bit and 16-bit integers #372

i64x2.widen_(low/high)_i32x4_(s/u) instructions extensions for 8-bit and 16-bit integers #372

omnisip commented Oct 6, 2020

Maratyszcza commented Oct 31, 2020

omnisip commented Oct 31, 2020 •

edited

omnisip commented Oct 31, 2020

omnisip commented Nov 6, 2020 •

edited

i64x2.widen_(low/high)_i32x4_(s/u) instructions extensions for 8-bit and 16-bit integers #372

i64x2.widen_(low/high)_i32x4_(s/u) instructions extensions for 8-bit and 16-bit integers #372

Comments

omnisip commented Oct 6, 2020

Introduction

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE4.1 instruction set

ARM64 processors

ARMv7 processors with NEON instruction set

Maratyszcza commented Oct 31, 2020

omnisip commented Oct 31, 2020 • edited

omnisip commented Oct 31, 2020

omnisip commented Nov 6, 2020 • edited

omnisip commented Oct 31, 2020 •

edited

omnisip commented Nov 6, 2020 •

edited