Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Move internal reserved registers to a side table #101647

Merged
merged 16 commits into from Apr 29, 2024

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Apr 27, 2024

Paying a bit more to access these seems worth it when it leads to a 10% reduction in size of GenTree. It's very rare for an IR node to have any internal registers allocated.

For arm64 this is the distribution of number of internal registers allocated for every GenTree after LSRA:

Num internal temps
     <=          0 ===> 16256371 count ( 99% of total)
      1 ..       1 ===>   40149 count ( 99% of total)
      2 ..       2 ===>   11118 count ( 99% of total)
      3 ..       3 ===>   10886 count ( 99% of total)
      4 ..       4 ===>    2201 count ( 99% of total)
      5 ..       5 ===>    1194 count (100% of total)
      6 ..       6 ===>       0 count (100% of total)
      7 ..       7 ===>       0 count (100% of total)
      >          8 ===>       0 count (100% of total)

Memory diff for arm64: https://www.diffchecker.com/fGt6Abyp/ (around 1.5% less memory used overall and makes it cheaper to expand to more than 64 registers in the future)

Contributes to #98258

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 27, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Apr 27, 2024

TP regressions seem a little higher than expected (though the larger ones are MinOpts in collections that don't really have any MinOpts contexts). For libraries_tests.run the detailed TP breakdown looks as follows for win-arm64:

Base: 687590947913, Diff: 688252896507, +0.0963%

372360965  : +12.49%  : 20.23% : +0.0542% : public: bool __cdecl GenTree::gtHasReg(class Compiler *) const                                                                                                                                     
239930066  : NA       : 13.04% : +0.0349% : public: bool __cdecl GenTree::IsMultiRegCall(void) const                                                                                                                                           
110847788  : +13.25%  : 6.02%  : +0.0161% : public: unsigned __int64 __cdecl GenTree::gtGetRegMask(void) const                                                                                                                                 
101054636  : +3.16%   : 5.49%  : +0.0147% : protected: void __cdecl CodeGen::genProduceReg(struct GenTree *)                                                                                                                                   
100502213  : NA       : 5.46%  : +0.0146% : public: void __cdecl NodeInternalRegisters::Add(struct GenTree *, unsigned __int64)                                                                                                                
89077935   : +8.05%   : 4.84%  : +0.0130% : public: bool __cdecl GenTree::IsMultiRegNode(void) const                                                                                                                                           
56210000   : NA       : 3.05%  : +0.0082% : class CodeGenInterface * __cdecl getCodeGenerator(class Compiler *)                                                                                                                                
42525916   : +12.28%  : 2.31%  : +0.0062% : private: void __cdecl LinearScan::BuildDefs(struct GenTree *, int, unsigned __int64)                                                                                                               
22625186   : NA       : 1.23%  : +0.0033% : public: enum _regNumber_enum __cdecl NodeInternalRegisters::Extract(struct GenTree *, unsigned __int64)                                                                                            
18567447   : +2.96%   : 1.01%  : +0.0027% : private: void __cdecl JitHashTable<struct CORINFO_FIELD_STRUCT_*, struct JitPtrKeyFuncs<struct CORINFO_FIELD_STRUCT_>, class FieldSeq, class CompAllocator, class JitHashTableBehavior>::Grow(void)
15514925   : +2.29%   : 0.84%  : +0.0023% : protected: void __cdecl CodeGen::genCallInstruction(struct GenTreeCall *)                                                                                                                          
13875031   : +2.47%   : 0.75%  : +0.0020% : private: void __cdecl emitter::emitInsLoadStoreOp(enum instruction, enum emitAttr, enum _regNumber_enum, struct GenTreeIndir *)                                                                    
12892530   : +1.03%   : 0.70%  : +0.0019% : private: int __cdecl LinearScan::BuildCall(struct GenTreeCall *)                                                                                                                                   
11865522   : +0.07%   : 0.64%  : +0.0017% : public: void * __cdecl ArenaAllocator::allocateMemory(unsigned __int64)                                                                                                                            
7973960    : NA       : 0.43%  : +0.0012% : public: enum _regNumber_enum __cdecl NodeInternalRegisters::GetSingle(struct GenTree *, unsigned __int64)                                                                                          
6961843    : NA       : 0.38%  : +0.0010% : public: unsigned int __cdecl NodeInternalRegisters::Count(struct GenTree *, unsigned __int64)                                                                                                      
6072251    : +55.03%  : 0.33%  : +0.0009% : public: unsigned int __cdecl GenTree::GetMultiRegCount(class Compiler *) const                                                                                                                     
4151149    : +15.57%  : 0.23%  : +0.0006% : protected: void __cdecl CodeGen::genMultiRegStoreToLocal(struct GenTreeLclVar *)                                                                                                                   
3725060    : +127.26% : 0.20%  : +0.0005% : public: enum var_types __cdecl GenTree::GetRegTypeByIndex(int) const                                                                                                                               
3091240    : +0.60%   : 0.17%  : +0.0004% : private: void __cdecl Compiler::fgInvokeInlineeCompiler(struct GenTreeCall *, class InlineResult *, class InlineContext **)                                                                        
2319037    : +0.61%   : 0.13%  : +0.0003% : public: struct Range __cdecl RangeCheck::ComputeRange(struct BasicBlock *, struct GenTree *, bool)                                                                                                 
-3220492   : -10.27%  : 0.17%  : -0.0005% : protected: void __cdecl CodeGen::genCodeForCpBlkUnroll(struct GenTreeBlk *)                                                                                                                        
-13773686  : -0.38%   : 0.75%  : -0.0020% : public: void __cdecl LinearScan::resolveRegisters<1>(void)                                                                                                                                         
-14291659  : -0.25%   : 0.78%  : -0.0021% : memset                                                                                                                                                                                             
-17480730  : -2.61%   : 0.95%  : -0.0025% : public: struct GenTree * __cdecl Compiler::gtCloneExpr(struct GenTree *)                                                                                                                           
-27597648  : -0.52%   : 1.50%  : -0.0040% : public: void __cdecl LinearScan::resolveRegisters<0>(void)                                                                                                                                         
-38963750  : -4.31%   : 2.12%  : -0.0057% : public: void __cdecl Compiler::compInit(class ArenaAllocator *, struct CORINFO_METHOD_STRUCT_*, class ICorJitInfo *, struct CORINFO_METHOD_INFO *, struct InlineInfo *)                            
-47352683  : -2.20%   : 2.57%  : -0.0069% : public: void __cdecl LinearScan::buildIntervals<1>(void)                                                                                                                                           
-50608682  : -5.57%   : 2.75%  : -0.0074% : public: void __cdecl LinearScan::buildIntervals<0>(void)                                                                                                                                           
-56483652  : -1.51%   : 3.07%  : -0.0082% : private: class RefPosition * __cdecl LinearScan::BuildDef(struct GenTree *, unsigned __int64, int)                                                                                                 
-315652337 : -99.64%  : 17.15% : -0.0459% : public: bool __cdecl GenTreeCall::HasMultiRegRetVal(void) const                                                                                                                                    

Seems mostly inlining related.

@jakobbotsch
Copy link
Member Author

For libraries.crossgen2.windows.arm64 it's definitely more noticeable:

Base: 177447753443, Diff: 177891827092, +0.2503%

174306936 : NA        : 24.22% : +0.0982% : public: void __cdecl NodeInternalRegisters::Add(struct GenTree *, unsigned __int64)                                                                                                                
80864611  : +12.49%   : 11.24% : +0.0456% : public: bool __cdecl GenTree::gtHasReg(class Compiler *) const                                                                                                                                     
49816419  : NA        : 6.92%  : +0.0281% : public: bool __cdecl GenTree::IsMultiRegCall(void) const                                                                                                                                           
47938235  : NA        : 6.66%  : +0.0270% : public: enum _regNumber_enum __cdecl NodeInternalRegisters::GetSingle(struct GenTree *, unsigned __int64)                                                                                          
43693038  : +17.18%   : 6.07%  : +0.0246% : private: void __cdecl JitHashTable<struct CORINFO_FIELD_STRUCT_*, struct JitPtrKeyFuncs<struct CORINFO_FIELD_STRUCT_>, class FieldSeq, class CompAllocator, class JitHashTableBehavior>::Grow(void)
38175511  : +18.28%   : 5.31%  : +0.0215% : public: bool __cdecl GenTree::IsMultiRegNode(void) const                                                                                                                                           
28826741  : +0.67%    : 4.01%  : +0.0162% : public: void * __cdecl ArenaAllocator::allocateMemory(unsigned __int64)                                                                                                                            
24704416  : NA        : 3.43%  : +0.0139% : class CodeGenInterface * __cdecl getCodeGenerator(class Compiler *)                                                                                                                                
21789774  : +13.51%   : 3.03%  : +0.0123% : public: unsigned __int64 __cdecl GenTree::gtGetRegMask(void) const                                                                                                                                 
20644334  : +3.20%    : 2.87%  : +0.0116% : protected: void __cdecl CodeGen::genProduceReg(struct GenTree *)                                                                                                                                   
12455570  : +1.02%    : 1.73%  : +0.0070% : memset                                                                                                                                                                                             
9069883   : +12.10%   : 1.26%  : +0.0051% : private: void __cdecl LinearScan::BuildDefs(struct GenTree *, int, unsigned __int64)                                                                                                               
8769472   : +6.27%    : 1.22%  : +0.0049% : protected: void __cdecl CodeGen::genCallInstruction(struct GenTreeCall *)                                                                                                                          
3195549   : NA        : 0.44%  : +0.0018% : public: enum _regNumber_enum __cdecl NodeInternalRegisters::Extract(struct GenTree *, unsigned __int64)                                                                                            
2973029   : +1.02%    : 0.41%  : +0.0017% : private: int __cdecl LinearScan::BuildCall(struct GenTreeCall *)                                                                                                                                   
2551146   : +0.11%    : 0.35%  : +0.0014% : public: void __cdecl LinearScan::resolveRegisters<1>(void)                                                                                                                                         
2071684   : +1087.25% : 0.29%  : +0.0012% : public: unsigned int __cdecl GenTreeLclVarCommon::GetLclNum(void) const                                                                                                                            
1984988   : +2.00%    : 0.28%  : +0.0011% : private: void __cdecl emitter::emitInsLoadStoreOp(enum instruction, enum emitAttr, enum _regNumber_enum, struct GenTreeIndir *)                                                                    
1177851   : +55.13%   : 0.16%  : +0.0007% : public: unsigned int __cdecl GenTree::GetMultiRegCount(class Compiler *) const                                                                                                                     
1036110   : +0.67%    : 0.14%  : +0.0006% : private: void __cdecl Compiler::fgInvokeInlineeCompiler(struct GenTreeCall *, class InlineResult *, class InlineContext **)                                                                        
1024481   : NA        : 0.14%  : +0.0006% : public: unsigned int __cdecl NodeInternalRegisters::Count(struct GenTree *, unsigned __int64)                                                                                                      
893343    : +15.27%   : 0.12%  : +0.0005% : protected: void __cdecl CodeGen::genMultiRegStoreToLocal(struct GenTreeLclVar *)                                                                                                                   
791266    : +127.25%  : 0.11%  : +0.0004% : public: enum var_types __cdecl GenTree::GetRegTypeByIndex(int) const                                                                                                                               
-2071684  : -100.00%  : 0.29%  : -0.0012% : public: virtual unsigned int __cdecl DefaultPolicy::EstimatedTotalILSize(void) const                                                                                                               
-2965180  : -2.67%    : 0.41%  : -0.0017% : public: struct GenTree * __cdecl Compiler::gtCloneExpr(struct GenTree *)                                                                                                                           
-17124652 : -5.25%    : 2.38%  : -0.0097% : public: void __cdecl Compiler::compInit(class ArenaAllocator *, struct CORINFO_METHOD_STRUCT_*, class ICorJitInfo *, struct CORINFO_METHOD_INFO *, struct InlineInfo *)                            
-21372341 : -3.01%    : 2.97%  : -0.0120% : public: void __cdecl LinearScan::buildIntervals<1>(void)                                                                                                                                           
-22153697 : -2.99%    : 3.08%  : -0.0125% : private: class RefPosition * __cdecl LinearScan::BuildDef(struct GenTree *, unsigned __int64, int)                                                                                                 
-70576480 : -99.23%   : 9.81%  : -0.0398% : public: bool __cdecl GenTreeCall::HasMultiRegRetVal(void) const                                                                                                                                    

I'll see if we can do something.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Apr 27, 2024

I got rid of an unnecessary hash map lookup in NodeInternalRegisters::Add, but cost is still +0.22%. The node types we most often see with internal registers on arm64 are:

Node types with internal regs
CALL                 : 2953364
STORE_BLK            :  237732
LEA                  :    8792
SWITCH_TABLE         :    6838
CNS_DBL              :    4548
PUTARG_STK           :    4472
CNS_VEC              :    2838
MUL                  :     608
LCLHEAP              :     336
RETURN               :     110
STOREIND             :      78
IND                  :      58
XAND                 :       8
STORE_LCL_VAR        :       6
INDEX_ADDR           :       4
LCL_FLD              :       2

GT_CALL is because we always reserve an extra internal register for R2R calls on arm64 to load the call target:

buildInternalIntRegisterDefForNode(call, candidates);

I'm inclined to just have the backend always use LR for this purpose since that register is being trashed by the call anyway.

Edit: Avoided this internal register by using LR instead, which had large TP improvements for crossgen2 arm64 compilations.

@jakobbotsch
Copy link
Member Author

The remaining one is libraries_tests_no_tiered_compilation.run.windows.arm64.Release.mch and looks to be the same as above, i.e. primarily because of some different inlining decisions:

Base: 13520450455, Diff: 13548034810, +0.2040%

12877010  : +12.50%  : 17.65% : +0.0952% : public: bool __cdecl GenTree::gtHasReg(class Compiler *) const                                                                                                                                     
8785441   : NA       : 12.04% : +0.0650% : public: bool __cdecl GenTree::IsMultiRegCall(void) const                                                                                                                                           
5540860   : NA       : 7.59%  : +0.0410% : public: void __cdecl NodeInternalRegisters::Add(struct GenTree *, unsigned __int64)                                                                                                                
3869963   : +12.68%  : 5.30%  : +0.0286% : public: unsigned __int64 __cdecl GenTree::gtGetRegMask(void) const                                                                                                                                 
3404030   : +3.09%   : 4.66%  : +0.0252% : protected: void __cdecl CodeGen::genProduceReg(struct GenTree *)                                                                                                                                   
3342806   : +7.92%   : 4.58%  : +0.0247% : public: bool __cdecl GenTree::IsMultiRegNode(void) const                                                                                                                                           
2274456   : NA       : 3.12%  : +0.0168% : public: enum _regNumber_enum __cdecl NodeInternalRegisters::Extract(struct GenTree *, unsigned __int64)                                                                                            
1903792   : NA       : 2.61%  : +0.0141% : class CodeGenInterface * __cdecl getCodeGenerator(class Compiler *)                                                                                                                                
1542740   : +12.58%  : 2.11%  : +0.0114% : private: void __cdecl LinearScan::BuildDefs(struct GenTree *, int, unsigned __int64)                                                                                                               
1257943   : +27.27%  : 1.72%  : +0.0093% : private: void __cdecl JitHashTable<struct CORINFO_FIELD_STRUCT_*, struct JitPtrKeyFuncs<struct CORINFO_FIELD_STRUCT_>, class FieldSeq, class CompAllocator, class JitHashTableBehavior>::Grow(void)
948336    : +55.17%  : 1.30%  : +0.0070% : public: unsigned int __cdecl GenTree::GetMultiRegCount(class Compiler *) const                                                                                                                     
890189    : NA       : 1.22%  : +0.0066% : public: unsigned int __cdecl NodeInternalRegisters::Count(struct GenTree *, unsigned __int64)                                                                                                      
739107    : +1.84%   : 1.01%  : +0.0055% : private: int __cdecl LinearScan::BuildCall(struct GenTreeCall *)                                                                                                                                   
637912    : +0.17%   : 0.87%  : +0.0047% : public: void * __cdecl ArenaAllocator::allocateMemory(unsigned __int64)                                                                                                                            
612467    : +15.66%  : 0.84%  : +0.0045% : protected: void __cdecl CodeGen::genMultiRegStoreToLocal(struct GenTreeLclVar *)                                                                                                                   
553196    : +127.27% : 0.76%  : +0.0041% : public: enum var_types __cdecl GenTree::GetRegTypeByIndex(int) const                                                                                                                               
349211    : +0.39%   : 0.48%  : +0.0026% : memset                                                                                                                                                                                             
332442    : +2.69%   : 0.46%  : +0.0025% : private: void __cdecl emitter::emitInsLoadStoreOp(enum instruction, enum emitAttr, enum _regNumber_enum, struct GenTreeIndir *)                                                                    
141203    : +0.55%   : 0.19%  : +0.0010% : protected: void __cdecl CodeGen::genCallInstruction(struct GenTreeCall *)                                                                                                                          
134540    : NA       : 0.18%  : +0.0010% : public: enum _regNumber_enum __cdecl NodeInternalRegisters::GetSingle(struct GenTree *, unsigned __int64)                                                                                          
-74232    : -2.79%   : 0.10%  : -0.0005% : public: struct GenTree * __cdecl Compiler::gtCloneExpr(struct GenTree *)                                                                                                                           
-77517    : -0.01%   : 0.11%  : -0.0006% : private: enum _regNumber_enum __cdecl LinearScan::allocateRegMinimal(class Interval *, class RefPosition *)                                                                                        
-79028    : -10.00%  : 0.11%  : -0.0006% : protected: void __cdecl CodeGen::genUnspillRegIfNeeded(struct GenTree *, unsigned int)                                                                                                             
-158056   : -16.67%  : 0.22%  : -0.0012% : public: enum _regNumber_enum __cdecl GenTree::GetRegByIndex(int) const                                                                                                                             
-420583   : -10.34%  : 0.58%  : -0.0031% : protected: void __cdecl CodeGen::genCodeForCpBlkUnroll(struct GenTreeBlk *)                                                                                                                        
-1232798  : -0.44%   : 1.69%  : -0.0091% : public: void __cdecl LinearScan::resolveRegisters<0>(void)                                                                                                                                         
-1319674  : -17.38%  : 1.81%  : -0.0098% : public: void __cdecl Compiler::compInit(class ArenaAllocator *, struct CORINFO_METHOD_STRUCT_*, class ICorJitInfo *, struct CORINFO_METHOD_INFO *, struct InlineInfo *)                            
-1519495  : -1.10%   : 2.08%  : -0.0112% : private: class RefPosition * __cdecl LinearScan::BuildDef(struct GenTree *, unsigned __int64, int)                                                                                                 
-3536016  : -5.38%   : 4.85%  : -0.0262% : public: void __cdecl LinearScan::buildIntervals<0>(void)                                                                                                                                           
-14124781 : -99.98%  : 19.36% : -0.1045% : public: bool __cdecl GenTreeCall::HasMultiRegRetVal(void) const                                                                                                                                    

@jakobbotsch jakobbotsch marked this pull request as ready for review April 28, 2024 12:45
@jakobbotsch
Copy link
Member Author

jakobbotsch commented Apr 28, 2024

cc @dotnet/jit-contrib PTAL @kunalspathak

Diffs

This gets rid of GenTree::gtRsvdRegs by moving internal registers to a side table. We generally use internal registers very rarely, so I think making the lookup more costly is worth the trade off (especially to make it easier to expand regMaskTP to 16 bytes).
There was one exception where we used internal registers a lot, which was GT_CALL for R2R codegen on arm64/arm32. For those nodes we always allocate an internal register to load the target into (the target is obtained by loading the R2R indirection cell that is passed in an argument register).
For arm64 it was simple to avoid this internal register: we can simply use LR always, since that register is going to be overwritten by the call anyway. This results in -2% TP for crossgen2 arm64 just from avoiding building this extra interval. This is also the cause of the asm diffs.
For arm32 the same strategy doesn't work as well because loading into LR is a 4 byte instruction while loading into other registers is a 2 byte instruction. So for arm32 I left it with the internal register and I propose we just take the throughput hit there (arm32 crossgen2 is probably not the most important target).

For the TP hit in some of the other collections see my comments above. Most of the hit comes from different inlining decisions that we expect to be significantly altered by native PGO anyway, or MinOpts in collections with very few MinOpts contexts that thus aren't very representative.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@jakobbotsch jakobbotsch merged commit 42f4c30 into dotnet:main Apr 29, 2024
103 of 111 checks passed
@jakobbotsch jakobbotsch deleted the internal-regs-side-table branch April 29, 2024 18:22
matouskozak pushed a commit to matouskozak/runtime that referenced this pull request Apr 30, 2024
This gets rid of `GenTree::gtRsvdRegs` by moving internal registers to a side
table. We generally use internal registers very rarely, so making the lookup
more costly seems worth the trade off (especially to make it easier to expand
`regMaskTP` to 16 bytes).

There was one exception where we used internal registers a lot, which was
`GT_CALL` for R2R codegen on arm64/arm32. For those nodes we always allocate an
internal register to load the target into (the target is obtained by loading the
R2R indirection cell that is passed in an argument register).

For arm64 it was simple to avoid this internal register: we can simply use LR
always, since that register is going to be overwritten by the call anyway. This
results in -2% TP for crossgen2 arm64 just from avoiding building this extra
interval. This is also the cause of the asm diffs.

For arm32 the same strategy doesn't work as well because loading into LR is a 4
byte instruction while loading into other registers is a 2 byte instruction. So
for arm32 we still use an internal register and take the small throughput hit.

This change reduces JIT memory usage by ~1.5%. The throughput cost (when
discounting some spurious inlining decision changes) seems to be around 0.1%.
michaelgsharp pushed a commit to michaelgsharp/runtime that referenced this pull request May 9, 2024
This gets rid of `GenTree::gtRsvdRegs` by moving internal registers to a side
table. We generally use internal registers very rarely, so making the lookup
more costly seems worth the trade off (especially to make it easier to expand
`regMaskTP` to 16 bytes).

There was one exception where we used internal registers a lot, which was
`GT_CALL` for R2R codegen on arm64/arm32. For those nodes we always allocate an
internal register to load the target into (the target is obtained by loading the
R2R indirection cell that is passed in an argument register).

For arm64 it was simple to avoid this internal register: we can simply use LR
always, since that register is going to be overwritten by the call anyway. This
results in -2% TP for crossgen2 arm64 just from avoiding building this extra
interval. This is also the cause of the asm diffs.

For arm32 the same strategy doesn't work as well because loading into LR is a 4
byte instruction while loading into other registers is a 2 byte instruction. So
for arm32 we still use an internal register and take the small throughput hit.

This change reduces JIT memory usage by ~1.5%. The throughput cost (when
discounting some spurious inlining decision changes) seems to be around 0.1%.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants