Support register-tight use cases #225

bjacob · 2020-05-11T19:50:08Z

This is open-ended. The problem is that many key use cases, such as matrix multiplication kernels, need to know a number of SIMD vector registers that they can count on using. In practice, the number of available architecture registers tends to be just large enough to hit peak performance, so matrix multiplication kernels tend to use all available registers. Here is an example.

In theory, a higher-level language (than raw asm) such as WebAsm abstracts away this fixed number of architecture registers, offering infinitely many variables instead. In practice, register-intensive simd kernels are one area where this abstraction has not been working well. This abstraction is based on spilling registers as necessary, which has only a marginal performance impact on most code, but has often catastrophic impact on register-intensive simd kernels (performance degradations > 2x, sometimes 10x).

This prompts a few question for someone trying to write WebAsm matrix multiplication kernels:

Can the programmer query the number of architecture registers?
Can the programmer make assumptions about the correspondence between the number of SIMD vector variables used in a part of the program, and the register usage of the generated code?

These issues have been severely affecting also C/C++ with intrinsics, and are the main reason why many people prefer to write assembly instead. However, in C/C++ with intrinsics, at least:

One knows the target architecture.
One can "massage" the compiler into generating the expected code. Compilation is AOT and one gets a chance to look at the generated code before shipping.

I'm afraid that these issues, with are bad enough in C/C++ intrinsics to halfway kill this programming model for critical use cases, will affect WebAsm SIMD more severely still due to the abstraction of the client device and browser and the JIT compilation.

tlively · 2020-05-12T18:55:07Z

Can the programmer query the number of architecture registers?

No, exposing underlying architectural details would introduce platform-specific behavior and violate WebAssembly's determinism. Although this kind of nondeterminism might be considered for a future proposal, it is out of scope for this SIMD proposal.

Can the programmer make assumptions about the correspondence between the number of SIMD vector variables used in a part of the program, and the register usage of the generated code?

No, different engines may make different register allocation decisions and may optimize or otherwise transform the code however they deem fit, so programmers should not be making these sorts of assumptions. It may be possible to make assumptions about codegen for a particular engine, but it should not be assumed that those assumptions will generalize to other engines.

The low-level, portable SIMD instructions in this proposal have proven to be useful for a wide variety of workloads, but we are aware that there are also many workloads that depend on non-portable instructions. Keep an eye out for future proposals meant to address this problem.

bjacob changed the title ~~Support use cases that need to target a specific number of registers.~~ Support register-tight use cases May 12, 2020

tlively added the post SIMD MVP label May 12, 2020

XapaJIaMnu mentioned this issue Dec 8, 2020

Compile libmariandecoder to wasm browsermt/marian-dev#6

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support register-tight use cases #225

Support register-tight use cases #225

bjacob commented May 11, 2020 •

edited

tlively commented May 12, 2020

Support register-tight use cases #225

Support register-tight use cases #225

Comments

bjacob commented May 11, 2020 • edited

tlively commented May 12, 2020

bjacob commented May 11, 2020 •

edited