Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

i64x2.widen_(low/high)_i32x4_(s/u) instructions extensions for 8-bit and 16-bit integers #372

Open
omnisip opened this issue Oct 6, 2020 · 4 comments

Comments

@omnisip
Copy link

omnisip commented Oct 6, 2020

Introduction

This proposal mirrors #290 to add new variants of existing widen instructions and extends the 32 and 64 widen instructions to include support from 16 and 8-bit integers. The practical use case for this is signal processing -- specifically audio and image processing, but the use cases for this are pretty large in general. For a non-image processing use case, these could be very helpful any time someone wants to convert an 8-bit value to a floating-point number. Currently, this requires multiple conversions steps between integers before converting to float, but modern architectures provide operations to convert from just about any integer size to another. Due to the non-binary relationship between 8 bits and 64 bits, this instruction will introduce new terminology that will replace the high/low terminology with a constant parameter immediate. This ticket will serve as the foundation for the PR that follows and will be updated with implementation details for each instruction set.

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE4.1 instruction set

ARM64 processors

ARMv7 processors with NEON instruction set

@omnisip omnisip changed the title i64x2.widen_(low/high)_i32x4_(s/u) instructions extensions for 8-bit and 64-bit integers i64x2.widen_(low/high)_i32x4_(s/u) instructions extensions for 8-bit and 16-bit integers Oct 6, 2020
@Maratyszcza
Copy link
Contributor

This operation doesn't have a direct equivalent in ARM NEON or ARM64

@omnisip
Copy link
Author

omnisip commented Oct 31, 2020

I'm still investigating myself, but I have found what seems to be an efficient solution for 16 8-bit integers to 16 32bit integers. You can see the Godbolt here. (Edited to fix a typo in Godbolt. This is definitely a work in progress.)

@omnisip
Copy link
Author

omnisip commented Oct 31, 2020

In this Godbolt there are three samples showcasing different implementations on Intel. It's really fascinating too since the shortest implementation underperforms both of the longer implementations because pmovzxbd and palignr block the shuffle port. The two longer ones finish in 604 cycles whereas the shortest one takes 704.

@omnisip
Copy link
Author

omnisip commented Nov 6, 2020

@Maratyszcza Please take a look at this Godbolt.

Regarding your earlier point about missing ARM instructions, you're right. However, If you look at this analysis you'll see that pmovsxbd and pmovzxbd don't provide any consequential benefits when working with 2 xmms. Their benefit is only useful with respect to going to and from memory. However, that benefit even then is of limited significance on machines with only 1 shuffle port.

With respect to ARM, the 6 instruction pass that yields 4 vectors of 32bit integers is most efficient in two specific cases:

  1. The goal is to end up with only 4 32bit vectors (signed or unsigned).
  2. In any and all signed variants.

This is mostly because you don't get the individual 32bit conversion efficiency seen with lesser ops, or the large unsigned 32bit cases, where you can take advantage of the table indices multiple times. If the load of the indices only needs to happen once, the table transform will outperform in all cases where the data output will remain unsigned. The signed versus unsigned distinction is important on ARM64 since sshr competes with the tbl operations.

On the other hand, x64 chips get tremendous performance advantages no matter which way you look at it, and that's despite a great increase in the number of instructions. Going from 6 shuffles (punpck* + pxor) or 7 shuffles (4 pmovsxbd + 3 palignr) to a 4 shuffle solution translates to a minimum of a 30% performance improvement. In real terms, that means 600-700 cycles becoming 409 on Skylake. And if you need to keep the data signed? No problem. LLVM-MCA is reporting 410 cycles when adding 4 extra psrad instructions per loop.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants