New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Move internal reserved registers to a side table #101647
JIT: Move internal reserved registers to a side table #101647
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
TP regressions seem a little higher than expected (though the larger ones are MinOpts in collections that don't really have any MinOpts contexts). For
Seems mostly inlining related. |
For
I'll see if we can do something. |
I got rid of an unnecessary hash map lookup in
runtime/src/coreclr/jit/lsraarmarch.cpp Line 193 in 55d2ada
I'm inclined to just have the backend always use Edit: Avoided this internal register by using LR instead, which had large TP improvements for crossgen2 arm64 compilations. |
The remaining one is libraries_tests_no_tiered_compilation.run.windows.arm64.Release.mch and looks to be the same as above, i.e. primarily because of some different inlining decisions:
|
cc @dotnet/jit-contrib PTAL @kunalspathak This gets rid of For the TP hit in some of the other collections see my comments above. Most of the hit comes from different inlining decisions that we expect to be significantly altered by native PGO anyway, or MinOpts in collections with very few MinOpts contexts that thus aren't very representative. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
This gets rid of `GenTree::gtRsvdRegs` by moving internal registers to a side table. We generally use internal registers very rarely, so making the lookup more costly seems worth the trade off (especially to make it easier to expand `regMaskTP` to 16 bytes). There was one exception where we used internal registers a lot, which was `GT_CALL` for R2R codegen on arm64/arm32. For those nodes we always allocate an internal register to load the target into (the target is obtained by loading the R2R indirection cell that is passed in an argument register). For arm64 it was simple to avoid this internal register: we can simply use LR always, since that register is going to be overwritten by the call anyway. This results in -2% TP for crossgen2 arm64 just from avoiding building this extra interval. This is also the cause of the asm diffs. For arm32 the same strategy doesn't work as well because loading into LR is a 4 byte instruction while loading into other registers is a 2 byte instruction. So for arm32 we still use an internal register and take the small throughput hit. This change reduces JIT memory usage by ~1.5%. The throughput cost (when discounting some spurious inlining decision changes) seems to be around 0.1%.
This gets rid of `GenTree::gtRsvdRegs` by moving internal registers to a side table. We generally use internal registers very rarely, so making the lookup more costly seems worth the trade off (especially to make it easier to expand `regMaskTP` to 16 bytes). There was one exception where we used internal registers a lot, which was `GT_CALL` for R2R codegen on arm64/arm32. For those nodes we always allocate an internal register to load the target into (the target is obtained by loading the R2R indirection cell that is passed in an argument register). For arm64 it was simple to avoid this internal register: we can simply use LR always, since that register is going to be overwritten by the call anyway. This results in -2% TP for crossgen2 arm64 just from avoiding building this extra interval. This is also the cause of the asm diffs. For arm32 the same strategy doesn't work as well because loading into LR is a 4 byte instruction while loading into other registers is a 2 byte instruction. So for arm32 we still use an internal register and take the small throughput hit. This change reduces JIT memory usage by ~1.5%. The throughput cost (when discounting some spurious inlining decision changes) seems to be around 0.1%.
Paying a bit more to access these seems worth it when it leads to a 10% reduction in size of
GenTree
. It's very rare for an IR node to have any internal registers allocated.For arm64 this is the distribution of number of internal registers allocated for every
GenTree
after LSRA:Memory diff for arm64: https://www.diffchecker.com/fGt6Abyp/ (around 1.5% less memory used overall and makes it cheaper to expand to more than 64 registers in the future)
Contributes to #98258