Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

LLVM Cost Tuning #404

Open
omnisip opened this issue Dec 15, 2020 · 11 comments
Open

LLVM Cost Tuning #404

omnisip opened this issue Dec 15, 2020 · 11 comments

Comments

@omnisip
Copy link

omnisip commented Dec 15, 2020

Hi everyone,

I'm creating this ticket to keep Compiler Level Common Subexpression Elimination (#403) separate from LLVM Cost Tuning -- which is a common annoyance but is very different from the way we optimize the implementations of each individual instruction.

@tlively is suggesting we come up with a principled methodology for how we do this, so here's what I'm thinking.

Let's create a Google Sheet with every instruction we have listed on the left hand side. On the top for each column, we can put everyone's name and/or their github handle. Then in each row, they can put the reciprocal cost in tp they think should exist for each instruction.

As an example:

Instruction Dan Weber (@omnisip)
v128.load 0.25
v128.load8_splat 0.75
v128.load16_splat 0.75
... ...

Then if anyone has specific comments about why they scored or costed something a certain way, they can highlight the cell and put sheet comments. For instance, I'm going to calculate the cost of my load operation based on what I would expect it to cost on x64 and ARM64 with a weighted blend. Since x64 affects me more because I use it more than ARM for WASM SIMD, I'm going to weigh it 75% and 25% for ARM64 and take the average.

As a side effect, I'll put a higher cost on i8x16.shr_s and i8x16.shr_u because no native instruction exists on x64 and they have a long sequence of instructions required to emulate the functionality.

If people are okay with this idea -- or -- have better ideas -- by all means, put them here. It might be possible for us to have more than 1 cost model with clang and WASM if we know that our audience has more of a certain device or architecture type than others.

Updated:

Here is the Google Sheet for posting your proposed model: https://docs.google.com/spreadsheets/d/1dYXjm40a2XzhjLGchlcMSdibSRvCvxUEBLHVyUc-Ivo/edit?usp=sharing

Focus first on scoring strategy and methodology, and then on individual instructions. If your proposed methodology is easy enough to calculate, it's possible someone else can help fill in the instruction costs.

@omnisip
Copy link
Author

omnisip commented Dec 15, 2020

Also -- if this task becomes a little too tedious -- it might be worth spending 15 minutes on each call (#402) until we've covered the entire instruction set. Unlike CSE, this is far more subjective to personal interests than others -- LLVM seems to do a pretty good job until it does a bad one. It seems reasonable to expect that we'll probably agree on weightings for all of the shuffle/swizzle variants that are natively supported on our target platforms -- just so LLVM doesn't mess them up anymore.

@tlively
Copy link
Member

tlively commented Dec 16, 2020

For instance, I'm going to calculate the cost of my load operation based on what I would expect it to cost on x64 and ARM64 with a weighted blend.

Rather than having everyone supply the costs they think would be reasonable, we should start out by agreeing on how this blending should work. Then we can apply the same blending principle to each instruction without having to separately achieve consensus on a cost for each one.

I propose that LLVM should use the maximum of the x64 and ARM64 costs for each instruction. That way it will prefer code sequences that are ok on both platforms over code sequences that are better on one platform but worse on the other. If that sounds reasonable, the remaining question would be how to generate costs for x64 and ARM64 lowerings of each instruction.

@omnisip
Copy link
Author

omnisip commented Dec 17, 2020

@tlively your comments aren't lost on me -- if we get stuck deliberating each individual instruction -- this will never get done.

On the flip side, forcing a specific approach to scoring on everyone (e.g. blending -- I was using it as an example, not a requirement) may not yield good results. Lastly, and perhaps most importantly, we're not stuck with having just one cost model -- just one default. We can implement more than one cost-model and likely have them incorporated into LLVM. These would simply be the equivalent of different -mcpu/-mtune settings based on someone's submission. The only gotcha here is that everyone who wants their cost model submitted will have to understand that they're responsible for maintaining it. In the meantime, it's probably not unrealistic to ask the LLVM folk to let us submit more than one merely for testing and evaluation purposes. They'll be understanding of the fact that we don't know what the underlying architecture is and may have good insight on how we should select a result.

Clarifying my earlier suggestion and to address some of your concerns -- I've put together a Google sheet that still goes instruction by instruction, but focuses first on scoring strategy rather than individual instruction performance. It seems most appropriate that we debate the strategy first, before the individual instructions, and allow people to highlight specific instructions that are worrisome across models. This should allow us to keep the debate on the cost models more focused.

The link to Google Sheet for additions can be found here:

https://docs.google.com/spreadsheets/d/1dYXjm40a2XzhjLGchlcMSdibSRvCvxUEBLHVyUc-Ivo/edit?usp=sharing

@tlively
Copy link
Member

tlively commented Dec 17, 2020

We certainly need a default cost model that tries to be agnostic to the underlying platform, so that's what we should think about first.

I'm not yet convinced that it is a good idea to have alternative platform-specific cost models as well. Having those would run counter to our philosophy of trying to present a portable abstraction over native SIMD instruction sets. If I become convinced that platform-specific cost models are a good idea, I would be happier if we had them for both ARM and Intel instruction sets rather than just one or the other.

I would certainly accept any experimental cost models in upstream LLVM as long as they have a clear experimental purpose and as long as the patch author intends to remove them once the experiments are done. For non-experimental cost models, I would want to get the blessing of the full SIMD group before committing to anything.

@omnisip
Copy link
Author

omnisip commented Dec 17, 2020

Do you think we'll be able to arrive at a new replacement default model without testing and benchmarking?

@tlively
Copy link
Member

tlively commented Dec 17, 2020

Testing and benchmarking will probably be useful for coming up for the relative costs on instructions on individual platforms, but I think we should be able to come up with a strategy for combining platform-specific costs into platform-agnostic costs without any benchmarking. That strategy is based more on our high-level goals for the cost model.

@aardappel
Copy link

why not just work on estimating 2 columns, one for x64 and one for arm64, then average them later?

@tlively
Copy link
Member

tlively commented Dec 17, 2020

We should definitely get x64 and arm64 costs and combine them later. It would be good to choose how we combine them in a principled manner, though, rather than arbitrarily choosing to take the average or min or max or whatever.

@Maratyszcza
Copy link
Contributor

I would like to note that there isn't such thing as "x86-64 instruction cost" and "ARM64 instruction cost": each microarchitecture has its own throughput and latency. So if the plan is to blend instruction characteristics across several processors, we should start with selecting relevant microarchitectures first.

@penzn
Copy link
Contributor

penzn commented Dec 19, 2020

I propose that LLVM should use the maximum of the x64 and ARM64 costs for each instruction. That way it will prefer code sequences that are ok on both platforms over code sequences that are better on one platform but worse on the other. If that sounds reasonable, the remaining question would be how to generate costs for x64 and ARM64 lowerings of each instruction.

I second that - "least common denominator" approach makes sense, as we don't know what the code would be running on.

The main wrinkle with that is that cost model is not just a mapping of instructions to a numeric value - there are levels of logic in various parts of the compiler, like ABI, vectorizer, instruction selection, including relative minute cost calculations in some cases. The way this logic is organized differently for different targets, with different "questions" being asked and answered. For example compare llvm/lib/Target/X86/X86TargetTransformInfo.h and llvm/lib/Target/ARM/ARMTargetTransformInfo.h.

I don't think we would be able to simply merge X86 and Arm cost models - there are lots of moving parts, but we can probably start with trying to address the things we know about - shuffle masks, comparisons, and so on.

We can implement more than one cost-model and likely have them incorporated into LLVM. These would simply be the equivalent of different -mcpu/-mtune settings based on someone's submission.

Would not this mean that somebody would be able to post a module online which would work well only on a particular device (or class of devices)?

@omnisip
Copy link
Author

omnisip commented Dec 19, 2020

We can implement more than one cost-model and likely have them incorporated into LLVM. These would simply be the equivalent of different -mcpu/-mtune settings based on someone's submission.

Would not this mean that somebody would be able to post a module online which would work well only on a particular device (or class of devices)?

My comment was primarily for the purposes of allowing competitive evaluation of the models and to ensure that we didn't produce something that would produce worse optimizations than what LLVM is currently producing.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants