Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Common Subexpression Elimination #403

Open
omnisip opened this issue Dec 14, 2020 · 3 comments
Open

Common Subexpression Elimination #403

omnisip opened this issue Dec 14, 2020 · 3 comments

Comments

@omnisip
Copy link

omnisip commented Dec 14, 2020

Hi everyone,

@abrown asked me to look into optimizations for the extended multiplication in #376 that would reduce the number of shuffles before executing pmuldq (The proposed optimization is here). In the midst of doing research and analyzing the code, it became clear that for every wasmsimd instruction that required a sequence of instructions to implement, there would be redundant operations if there was more than one operation using the same parameters. In the xnnpack case, the multiplier is reused in each of the subsequent calculations. As a side effect, pshufd can be called repeatedly on the multiplier even if the required intermediate was already generated as part of a previous operation.

What if we implemented something like an LRU on register outputs for every instruction that comes through? Something like this has to exist on some level since the compiler has to keep track of register allocations before it gets to instruction generation. But what if we extended this mechanism such that for every instruction that is assembled, it first checks to see if a previous temporary register with the output of that instruction is set and is valid?

The benefit of such optimization could mean tremendously better code generation in cases like shuffles, swizzles, widen_high/low, multiplications and any other scenario that we thought might be appropriate for multi-return.

@lars-t-hansen
Copy link
Contributor

Optimizing compilers should implement this optimization naturally, provided the Wasm SIMD operations (WSOs for short) that require multiple target operations to implement are decomposed early enough for the optimizer to work on the individual pieces.

Currently Firefox will treat most WSOs as indivisible for the purposes of high-level optimizations, including common subexpression elimination; a WSO that is redundant with a previous one that computes the same result may reuse the result of the first. This architecture allows us to ignore the target architecture in the upper levels of the optimizer; shuffles are rewritten as target-specific operations later in the pipeline, for example. But it's well understood that this ignores many optimization opportunities. Constants/masks that are needed in a target-specific way to implement a WSO are loaded into temps for that WSO, and then not reused. And the case @omnisip brings up, where the results of target operations that express subcomputations are not reused when they could be, is similar.

The alternative for us, which I think will be necessary by and by, is to "legalize" the WSOs earlier in the pipeline, ie break them down into units of optimization close to the target machine, in particular exposing masks and other subcomputations that have high reuse potential. This is not without problems, as the legalization will remove some high-level context and possibly make it harder to implement some optimizations we implement now. (For example, currently a swizzle with a constant mask is optimized as if it were a shuffle, but that optimization benefits from constant expression evaluation having been performed already.) So this architectural change is not a one-day job, exactly. But as I said, it will become necessary as we reach for better performance, and I believe this will completely subsume any local tweaks such as an LRU cache of subcomputation results.

This brings me to something we touched on at the last SIMD meeting and may need to talk about some more, namely optimizations. I sometimes get a queasy feeling that people who use SIMD believe that they are controlling the generated machine code to a great extent, that the exact operations they write and the order they write them in is preserved in the machine code. (The discussion around the prefetch instruction is symptomatic of this, as is the desire to know how many registers the target architecture has.) It just ain't so. The optimizer has wide latitude to rewrite the code and will often do so as a result of general algorithms, hoisting values out of loops, using registers for temps that the SIMD programmer had not considered, and so on. This is true whether the program is written in C++ and compiled to wasm, or is hand-coded directly in wasm. It is the reason why general optimizations can be brought to bear on the CSE problem and no local SIMD-specific hacks are needed for good code generation. But it also means that there's not an easy way for the SIMD programmer to control the generated machine code.

@omnisip
Copy link
Author

omnisip commented Dec 15, 2020

As a general rule, I'm in favor of real CSE optimization. Real CSE optimization would allow us to perform accurate cost models on both WSOs as well as the underlying assembly that composes those instructions. The framework necessary to implement it would also make it possible to do different types of peephole optimizations that take advantage of more esoteric instructions of a given architecture. Ultimately that could offer the best optimization for all architectures with the most concise, easy to read code -- leaving WebAssembly SIMD more in the "what to do" and not the "how to do it" arena.

However, due consideration needs to be taken to cases where we may get it wrong. Many of the libraries we've ported to WASM have assembler files separate from the C/C++ code using intrinsics. The rationale for doing such things often arises from cases where the compiler doesn't handle the intrinsics properly or makes improper optimizations. Does that mean someone would abandon WebAssembly SIMD for performance reasons? I'd sure hope not. Though I wonder what control and what level of control the programmer should have with respect to the final output.

One of the things omitted from the initial post of this thread was that the same XNNPACK example discussed earlier had another set of optimizations that were correct, but didn't make it through. @Maratyszcza sought to balance the port usage between add and shift left with respect to doubling of values. He made about 2/3rds adds and 1/3rd shift left on the assumption that there would be more add ports than shifting ports -- a safe assumption on just about any platform. However, something in the toolchain decided to convert every single one of those adds to shifts left. In this particular case, we can go look at Binaryen or LLVM and see if we can turn that optimization off, but that option doesn't exist for optimizations made by the WebAssembly Engine/Compiler itself. If an end user's WASM compiler is set to run with -ffast-math by default, it's unlikely the programmer can do anything to fix a computation, short of switching to fixed point arithmetic.

@tlively
Copy link
Member

tlively commented Dec 15, 2020

He made about 2/3rds adds and 1/3rd shift left on the assumption that there would be more add ports than shifting ports -- a safe assumption on just about any platform. However, something in the toolchain decided to convert every single one of those adds to shifts left.

From the sound of it, this is almost certainly an LLVM issue. We have not yet tuned LLVM's cost analysis for WebAssembly SIMD, so it makes a lot of "optimizations" that aren't so helpful in practice. It would be good to talk as a larger group about a principled method for tuning the cost model given the different characteristics of different platforms the code may be run on.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants