Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISC-V] Use common CodeGen::genHomeRegisterParams #101288

Merged
merged 16 commits into from Apr 27, 2024

Conversation

tomeksowi
Copy link
Contributor

Follow-up to #101114, concludes #100744 for RISC-V.

Part of #84834, cc @dotnet/samsung

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 19, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Apr 19, 2024
@clamp03 clamp03 added the arch-riscv Related to the RISC-V architecture label Apr 20, 2024
…SC-V

This is what RISC-V did with old ABI classifiers
{
genHomeSwiftStructParameters(/* handleStack */ true);
genHomeSplitStructParameters(/* handleStack */ true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling genHomeSplitStructParameters only with handleStack=true feels a bit hackish but it does solve the problem and generates the same instructions as previously.

@jakobbotsch Let me know if you think 06e9d65 is the right approach for split structs in general.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling it only for the stack segments makes sense to me. We should be able to do that for Swift as well now that genHomeRegisterParams is sufficiently general (no need to include that change as part of this PR).

I'm surprised you need this very general treatment for RISC-V. This handling is needed for the Swift ABI because we cannot rebuild the struct layout simply by storing parts passed in registers below the parts passed on stack. The layout does not match and there can be intervening return address on the stack for x64.
For our other ABIs (windows arm64 with varargs, arm32, my impression is LA64 too) the stack segments of the split parameters will always be at SP on entry to the function. Spilling the register parts right below that will result in the correct struct layout. So there is no need to move the stack segments around at all. Is it not also the case for RISC-V? The diff in #101340 has some more context that may shed some light.

For example: this test has a split parameter V04 on windows-arm64:

Parameter V00 ABI info: [00..08) reg x0
Parameter V01 ABI info: 2 segments
  [0] [00..08) reg x1
  [1] [08..16) reg x2
Parameter V02 ABI info: 2 segments
  [0] [00..08) reg x3
  [1] [08..16) reg x4
Parameter V03 ABI info: 2 segments
  [0] [00..08) reg x5
  [1] [08..16) reg x6
Parameter V04 ABI info: 2 segments
  [0] [00..08) reg x7
  [1] [08..16) stack @ +00
Parameter V05 ABI info: [00..16) stack @ +08
Parameter V06 ABI info: [00..16) stack @ +24
Parameter V07 ABI info: [00..16) stack @ +40
Parameter V08 ABI info: [00..16) stack @ +56
Parameter V09 ABI info: [00..16) stack @ +72
Parameter V10 ABI info: [00..16) stack @ +88

To reassemble the struct we allocate V04 at the very top of the frame and store just the register part there:

;  V00 VarArgHandle [V00    ] (  1,  1   )    long  ->  [fp+0x20]  do-not-enreg[V] ld-addr-op
;  V01 arg0         [V01    ] (  1,  1   )  struct (16) [fp+0x28]  do-not-enreg[XSA] multireg-arg addr-exposed <NativeVarargTest.TwoLongStruct>
;  V02 arg1         [V02    ] (  1,  1   )  struct (16) [fp+0x38]  do-not-enreg[XSA] multireg-arg addr-exposed <NativeVarargTest.TwoLongStruct>
;  V03 arg2         [V03    ] (  1,  1   )  struct (16) [fp+0x48]  do-not-enreg[XSA] multireg-arg addr-exposed <NativeVarargTest.TwoLongStruct>
;  V04 arg3         [V04    ] (  1,  1   )  struct (16) [fp+0x58]  do-not-enreg[XSA] multireg-arg addr-exposed <NativeVarargTest.TwoLongStruct>
;  V05 arg4         [V05    ] (  1,  1   )  struct (16) [fp+0x68]  do-not-enreg[XS] addr-exposed <NativeVarargTest.TwoLongStruct>
;  V06 arg5         [V06    ] (  1,  1   )  struct (16) [fp+0x78]  do-not-enreg[XS] addr-exposed <NativeVarargTest.TwoLongStruct>
;  V07 arg6         [V07    ] (  1,  1   )  struct (16) [fp+0x88]  do-not-enreg[XS] addr-exposed <NativeVarargTest.TwoLongStruct>
;  V08 arg7         [V08    ] (  1,  1   )  struct (16) [fp+0x98]  do-not-enreg[XS] addr-exposed <NativeVarargTest.TwoLongStruct>
;  V09 arg8         [V09    ] (  1,  1   )  struct (16) [fp+0xA8]  do-not-enreg[XS] addr-exposed <NativeVarargTest.TwoLongStruct>
;  V10 arg9         [V10    ] (  1,  1   )  struct (16) [fp+0xB8]  do-not-enreg[XS] addr-exposed <NativeVarargTest.TwoLongStruct>
;  V11 loc0         [V11    ] (  1,  1   )    long  ->  [fp+0x18]  do-not-enreg[] must-init
;# V12 OutArgs      [V12    ] (  1,  1   )  struct ( 0) [sp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;
; Lcl frame size = 16

G_M48227_IG01:  ;; offset=0x0000
            stp     fp, lr, [sp, #-0x60]!
            stp     x0, x1, [sp, #0x20]
            stp     x2, x3, [sp, #0x30]
            stp     x4, x5, [sp, #0x40]
            stp     x6, x7, [sp, #0x50]     ; extra gunk generated for varargs that we never bothered optimizing
            mov     fp, sp
            str     xzr, [fp, #0x18]    // [V11 loc0]
            str     x0, [fp, #0x20]     // [V00 VarArgHandle]
            str     x1, [fp, #0x28]     // [V01 arg0]
            str     x2, [fp, #0x30]     // [V01 arg0+0x08]
            str     x3, [fp, #0x38]     // [V02 arg1]
            str     x4, [fp, #0x40]     // [V02 arg1+0x08]
            str     x5, [fp, #0x48]     // [V03 arg2]
            str     x6, [fp, #0x50]     // [V03 arg2+0x08]
            str     x7, [fp, #0x58]     // [V04 arg3]               ; register part of split parameter
                                                ;; size=60 bbWeight=1 PerfScore 14.50
G_M48227_IG02:  ;; offset=0x003C

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised you need this very general treatment for RISC-V. This handling is needed for the Swift ABI because we cannot rebuild the struct layout simply by storing parts passed in registers below the parts passed on stack. The layout does not match and there can be intervening return address on the stack for x64.

I may not need it that general for RISC-V, I went for sharing the Swift code because I'm not that familiar with the common parts of the codegen and it did sth similar to what we had for handling of split structs here in genHomeRegisterParams. For prolog of method_1I4S in test struct16args it used to generate a sequence like this:

G_M40687_IG01:  ;; offset=0x0000
            addi           sp, sp, -224
            sd             ra, 0(sp)
            sd             fp, 8(sp)
            addi           fp, sp, 8
            addi           t0, fp, 72
            sd             zero, 8(t0)
            sd             zero, 0(t0)
            addi           t0, t0, 16
            sd             zero, 8(t0)
            sd             zero, 0(t0)
            addi           t0, t0, 16
            sd             zero, 8(t0)
            sd             zero, 0(t0)
            addi           t0, t0, 16
            sd             zero, 8(t0)
            sd             zero, 0(t0)
            addi           t0, t0, 16
            sd             zero, 0(t0)
            sw             a0, 212(fp)
            sd             a1, 192(fp)
            sd             a2, 200(fp)
            sd             a3, 176(fp)
            sd             a4, 184(fp)
            sd             a5, 160(fp)
            sd             a6, 168(fp)
            sd             a7, 144(fp)  # first part of struct
            ld             t0, 224(sp)  # loading second part of struct
            sd             t0, 152(fp)  # and placing it after a7

The common genHomeRegisterParams doesn't generate the last load-store.

For our other ABIs (windows arm64 with varargs, arm32, my impression is LA64 too) the stack segments of the split parameters will always be at SP on entry to the function. Spilling the register parts right below that will result in the correct struct layout. So there is no need to move the stack segments around at all. Is it not also the case for RISC-V? The diff in #101340 has some more context that may shed some light.

Hm, that would be reconstructing the struct cross-frame, interesting. If I understand correctly, "Spilling the register parts right below" in the above asm example would be sd a7 216(sp) instead of sd a7, 144(fp) (and the following load-store)? If so, I need to check if it's possible to have a contiguous space for the struct that way. Where are the spilling instructions normally generated for other archs?

For example: this test has a split parameter V04 on windows-arm64:

Varargs are now unsupported on RISC-V but the above quoted test is similar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, "Spilling the register parts right below" in the above asm example would be sd a7 216(sp) instead of sd a7, 144(fp) (and the following load-store)? If so, I need to check if it's possible to have a contiguous space for the struct that way. Where are the spilling instructions normally generated for other archs?

Right, exactly. It would be possible if the stack segment is always guaranteed to start at sp+0 when the split happens. I think the calling conventions are generally designed to allow this because it makes handling of varargs possible (the callee can spill all parameters and know that everything is laid out contiguously on stack).

arm64 handles this in genHomeRegisterParams; it falls out naturally due to how things work out. arm32 has its own mechanism calls "prespills" where it will push registers on entry to the function to make this the case. It is effectively the same thing but represented a different way in the JIT, so I would like to unify it with arm64 at some point (arm32 was the first non-x86-x64 target, so it has a lot of special behavior. We have been working to get rid of it.)

lvaAssignVirtualFrameOffsetsToArgs is responsible for assigning these offsets. It allocates offsets relative to the "virtual 0", which is the first offset of space passed on the stack by the caller. For your split parameter I think it should already be allocating it at address -8, which is what is expected to handle the split properly without moving the stack segment. I am not sure what ends up overwriting this assignment. (FWIW I am rewriting it in #101340 to be based on the new ABI info, so it may be simpler to read the diff there. That PR will definitely allocate your split parameter at -8.)

Varargs are now unsupported on RISC-V but the above quoted test is similar.

Right, I didn't mean to imply that this was a varargs thing, it was just an example (win-arm64 splits parameters, but only if there is also varargs. arm32 splits parameters always, so I could have used some cases from there as an example too).

Copy link
Contributor Author

@tomeksowi tomeksowi Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lvaAssignVirtualFrameOffsetsToArgs is responsible for assigning these offsets. It allocates offsets relative to the "virtual 0", which is the first offset of space passed on the stack by the caller. For your split parameter I think it should already be allocating it at address -8, which is what is expected to handle the split properly without moving the stack segment. I am not sure what ends up overwriting this assignment.

lvaAssignVirtualFrameOffsetsToArgs looks healthy, V04 is at -8 like you expected. But then looks like lvaAllocLocalAndSetVirtualOffset called here is setting the stack offset to -0x48(-72) which ends up being 144 after delta bump.

*************** In lvaAssignFrameOffsets(FINAL_FRAME_LAYOUT)
Set V04 to offset -8
Assign V00 arg0, size=4, stkOffs=-0x4
Pad V01 arg1, size=16, stkOffs=-0x8, pad=4
Assign V01 arg1, size=16, stkOffs=-0x18
Assign V02 arg2, size=16, stkOffs=-0x28
Assign V03 arg3, size=16, stkOffs=-0x38
Assign V04 arg4, size=16, stkOffs=-0x48
Assign V05 loc0, size=4, stkOffs=-0x4c
Assign V06 loc1, size=4, stkOffs=-0x50
Assign V07 loc2, size=4, stkOffs=-0x54
Assign V08 loc3, size=4, stkOffs=-0x58
Assign V09 loc4, size=4, stkOffs=-0x5c
Assign V10 loc5, size=4, stkOffs=-0x60
Assign V11 loc6, size=4, stkOffs=-0x64
Assign V12 loc7, size=4, stkOffs=-0x68
Assign V13 loc8, size=4, stkOffs=-0x6c
Assign V14 loc9, size=4, stkOffs=-0x70
Assign V15 loc10, size=4, stkOffs=-0x74
Assign V16 loc11, size=4, stkOffs=-0x78
Assign V17 loc12, size=4, stkOffs=-0x7c
Assign V18 loc13, size=4, stkOffs=-0x80
Assign V19 loc14, size=4, stkOffs=-0x84
Assign V20 loc15, size=4, stkOffs=-0x88
Assign V21 loc16, size=4, stkOffs=-0x8c
Assign V22 loc17, size=4, stkOffs=-0x90
Assign V24 tmp1, size=4, stkOffs=-0x94
Assign V25 tmp2, size=4, stkOffs=-0x98
Assign V26 tmp3, size=4, stkOffs=-0x9c
Assign V27 tmp4, size=4, stkOffs=-0xa0
Assign V28 tmp5, size=4, stkOffs=-0xa4
Assign V29 tmp6, size=4, stkOffs=-0xa8
Assign V30 tmp7, size=4, stkOffs=-0xac
Assign V31 tmp8, size=4, stkOffs=-0xb0
Assign V32 tmp9, size=4, stkOffs=-0xb4
Assign V33 tmp10, size=4, stkOffs=-0xb8
Assign V34 tmp11, size=4, stkOffs=-0xbc
Assign V35 tmp12, size=4, stkOffs=-0xc0
Assign V36 tmp13, size=4, stkOffs=-0xc4
Assign V37 tmp14, size=4, stkOffs=-0xc8
Assign V38 tmp15, size=4, stkOffs=-0xcc
Assign V39 tmp16, size=4, stkOffs=-0xd0
--- delta bump 216 for FP frame
--- virtual stack offset to actual stack offset delta is 216
-- V00 was -4, now 212
-- V01 was -24, now 192
-- V02 was -40, now 176
-- V03 was -56, now 160
-- V04 was -72, now 144

Not sure if V04 shouldn't fall into one of the continue statements before lvaAllocLocalAndSetVirtualOffset is called...

EDIT: ... maybe here?

if (varDsc->lvIsParam)
{
#ifdef TARGET_ARM64
if (info.compIsVarArgs && varDsc->lvIsRegArg &&
(varDsc->GetArgReg() != theFixedRetBuffReg(info.compCallConv)))
{
// Stack offset to varargs (parameters) should point to home area which will be preallocated.
const unsigned regArgNum = genMapIntRegNumToRegArgNum(varDsc->GetArgReg(), info.compCallConv);
varDsc->SetStackOffset(-initialStkOffs + regArgNum * REGSIZE_BYTES);
continue;
}
#endif

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do see other stuff allocated on the frame first, before the parameters, so the frame bump can probably not be done in the loop like I suggest and instead needs handling before we go into the loop over the locals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably need to add some handling in lvaParamHasLocalStackSpace, to return false for the split case. But you will also need to make sure it allocates space for the register segments in the caller for this case (I think you can use lvaIncrementFrameSize).

Makes sense, thank you. I'll try to do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting the split argument at the top of the frame requires moving everything down the stack, including stack pointer and return address, basically what arm does for varargs (for RISC-V we could limit that only to the last argument register). This is a bit more involved than I anticipated. Since this is just an optimization, I think it's better to revisit it after #101340 is merged to have a cleaner slate to work on.

Meanwhile, could we review & merge this PR as is (temporarily using Swift code for split arg reconstruction)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's more involved than anticipated then I think it's fine to just go with this approach permanently. No need to spend more effort to make the split parameters contiguous on my behalf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. I like the idea but there's still a bit work to do on RISC-V before we get to optimizations like this.

@tomeksowi
Copy link
Contributor Author

CLR tests are running, I'll undraft when they pass.

Comment on lines 5558 to 5563
#ifdef TARGET_RISCV64
if (compiler->fgFirstBBisScratch() && compiler->lvaHasAnyStackParamToReassemble())
{
genHomeStackParams(/* handleStack */ true);
}
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scratch BB is not necessary when it happens during the prolog.

I think it would be better to do this before you home register parameters (in the non-OSR case above), and passing initReg/initRegZeroed as well so there is no chance of overwriting a homed register.

@@ -4224,7 +4227,7 @@ void CodeGen::genHomeSwiftStructParameters(bool handleStack)
}
}
}
#endif
#endif // defined(SWIFT_SUPPORT) || defined(TARGET_RISCV64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this going to result in a lot of extra stack stores for RISCV?
At this point it seems like just factoring parts of this helper out and calling it from the same place as genHomeRegisterParams on the split stack segment will end up being a simpler change overall.

Copy link
Contributor Author

@tomeksowi tomeksowi Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't make unnecessary stores from what I examined the diffs but you're right, a RISC-V specific routine (I tried to phrase it so LoongArch can also reuse) turned out much handier.

@tomeksowi tomeksowi marked this pull request as ready for review April 26, 2024 14:43
Comment on lines 3143 to 3146
if (isSpilled)
{
genHomeStackPartOfSplitParameter(lclNum);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this handling outside genHomeRegisterParams, before it is called? Prolog generation becomes much more constrained after genHomeRegisterParams has run, and this handling doesn't have anything to do with moving registers to their initial locations. I would also switch to use initReg instead of REG_SCRATCH for the same reason (using REG_SCRATCH wouldn't work on other platforms at the point this runs, for example. I guess it works because RISCV64 never allocates REG_SCRATCH for anything, but on other platforms that's not a valid register to clobber after genHomeRegisterParams.).

You can factor the code in the stack case from genHomeSwiftStructParameters to avoid duplicating that logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was split parameter homing can happen can only if there are some argument registers to home (otherwise the parameter wouldn't be split), but ok, it can also be outside.

Co-authored-by: Jakob Botsch Nielsen <Jakob.botsch.nielsen@gmail.com>
Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you! @shushanhf I imagine you'll want to enable this for LA64 too.

Comment on lines +4162 to +4163
int loadOffset =
-(isFramePointerUsed() ? genCallerSPtoFPdelta() : genCallerSPtoInitialSPdelta()) + (int)seg.GetStackOffset();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: for the future, we generally restrict our ternary use to very simple cases. I think the existing code was easier to read.
https://github.com/dotnet/runtime/blob/main/docs/coding-guidelines/clr-jit-coding-conventions.md#1519-ternary-operators

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified it within #101656

@jakobbotsch jakobbotsch merged commit 81c1d87 into dotnet:main Apr 27, 2024
107 checks passed
@shushanhf
Copy link
Contributor

LGTM. Thank you! @shushanhf I imagine you'll want to enable this for LA64 too.

@jakobbotsch @tomeksowi
Thanks.
I will test it on LA64.

matouskozak pushed a commit to matouskozak/runtime that referenced this pull request Apr 30, 2024
* Code review from dotnet#101114

* Use common genHomeRegisterParams on RISC-V

* Make passSlot integer-only because we know hardware floating-point calling convention passes in registers only

* Make a RISC-V specific routine for homing stack parts of split parameters.

* Move genHomeStackPartOfSplitParameter out of genHomeSwiftStructParameters, share stack segment homing with Swift code
michaelgsharp pushed a commit to michaelgsharp/runtime that referenced this pull request May 9, 2024
* Code review from dotnet#101114

* Use common genHomeRegisterParams on RISC-V

* Make passSlot integer-only because we know hardware floating-point calling convention passes in registers only

* Make a RISC-V specific routine for homing stack parts of split parameters.

* Move genHomeStackPartOfSplitParameter out of genHomeSwiftStructParameters, share stack segment homing with Swift code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-riscv Related to the RISC-V architecture area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants