New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ARM32 atomic intrinsics #97792
base: main
Are you sure you want to change the base?
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Diff results for #97792Assembly diffsAssembly diffs for linux/arm ran on windows/x86Diffs are based on 2,238,105 contexts (829,328 MinOpts, 1,408,777 FullOpts). MISSED contexts: base: 71,274 (3.08%), diff: 72,559 (3.14%) Overall (-122,174 bytes)
MinOpts (+532 bytes)
FullOpts (-122,706 bytes)
Details here Throughput diffsThroughput diffs for linux/arm ran on windows/x86Overall (-0.41% to -0.04%)
MinOpts (-0.10% to -0.06%)
FullOpts (-0.43% to -0.04%)
Details here |
Diff results for #97792Assembly diffsAssembly diffs for linux/arm ran on windows/x86Diffs are based on 2,238,105 contexts (829,328 MinOpts, 1,408,777 FullOpts). MISSED contexts: base: 71,274 (3.08%), diff: 72,559 (3.14%) Overall (-141,104 bytes)
MinOpts (+532 bytes)
FullOpts (-141,636 bytes)
Details here Throughput diffsThroughput diffs for linux/arm ran on windows/x86Overall (-0.41% to -0.04%)
MinOpts (-0.10% to -0.06%)
FullOpts (-0.42% to -0.04%)
Details here |
Just a heads up. The tests fail on device. I am looking into it with Michal to see if we can find the root cause. |
1d24a02 fixed most of the failures, there's still incorrect sign extension of |
Diff results for #97792Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (+0.00% to +0.01%)
Throughput diffs for osx/arm64 ran on windows/x64MinOpts (+0.00% to +0.01%)
Details here |
Diff results for #97792Assembly diffsAssembly diffs for linux/arm ran on windows/x86Diffs are based on 2,237,693 contexts (829,328 MinOpts, 1,408,365 FullOpts). MISSED contexts: base: 71,275 (3.08%), diff: 72,560 (3.14%) Overall (-71,018 bytes)
MinOpts (+2,774 bytes)
FullOpts (-73,792 bytes)
Details here Throughput diffsThroughput diffs for linux/arm ran on windows/x86Overall (-0.41% to -0.04%)
MinOpts (-0.10% to -0.06%)
FullOpts (-0.42% to -0.04%)
Details here |
Diff results for #97792Assembly diffsAssembly diffs for linux/arm ran on windows/x86Diffs are based on 2,237,693 contexts (829,328 MinOpts, 1,408,365 FullOpts). MISSED contexts: base: 71,275 (3.08%), diff: 72,560 (3.14%) Overall (-71,018 bytes)
MinOpts (+2,774 bytes)
FullOpts (-73,792 bytes)
Details here Throughput diffsThroughput diffs for osx/arm64 ran on windows/x64MinOpts (+0.00% to +0.01%)
Throughput diffs for windows/arm64 ran on windows/x64MinOpts (+0.00% to +0.01%)
Details here Throughput diffs for linux/arm ran on windows/x86Overall (-0.41% to -0.04%)
MinOpts (-0.10% to -0.06%)
FullOpts (-0.42% to -0.04%)
Details here |
Diff results for #97792Assembly diffsAssembly diffs for linux/arm ran on windows/x86Diffs are based on 2,238,104 contexts (829,328 MinOpts, 1,408,776 FullOpts). MISSED contexts: base: 71,275 (3.08%), diff: 72,560 (3.14%) Overall (-70,778 bytes)
MinOpts (+2,774 bytes)
FullOpts (-73,552 bytes)
Details here Throughput diffsThroughput diffs for linux/arm ran on windows/x86Overall (-0.41% to -0.04%)
MinOpts (-0.10% to -0.06%)
FullOpts (-0.42% to -0.04%)
Details here |
Diff results for #97792Assembly diffsAssembly diffs for linux/arm ran on windows/x86Diffs are based on 2,238,104 contexts (829,328 MinOpts, 1,408,776 FullOpts). MISSED contexts: base: 71,275 (3.08%), diff: 72,560 (3.14%) Overall (-70,778 bytes)
MinOpts (+2,774 bytes)
FullOpts (-73,552 bytes)
Details here Throughput diffsThroughput diffs for linux/arm ran on windows/x86Overall (-0.41% to -0.04%)
MinOpts (-0.10% to -0.06%)
FullOpts (-0.42% to -0.04%)
Details here |
Passes the tests on Raspberry Pi 5 now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an unnecessary sign extension of the value
parameter in CompareExchange
(short
):
0004e2c0 <System.Threading.Interlocked__CompareExchange_0>:
4e2c0: b508 push {r3, lr}
4e2c2: b20b sxth r3, r1
4e2c4: b212 sxth r2, r2
4e2c6: f3bf 8f5f dmb sy
4e2ca: e8d0 ef5f ldrexh lr, [r0]
4e2ce: fa0f fe8e sxth.w lr, lr
4e2d2: 4596 cmp lr, r2
4e2d4: d103 bne.n 4e2de <System.Threading.Interlocked__CompareExchange_0+0x1e>
4e2d6: e8c0 3f51 strexh r1, r3, [r0]
4e2da: 2900 cmp r1, #0
4e2dc: d1f5 bne.n 4e2ca <System.Threading.Interlocked__CompareExchange_0+0xa>
4e2de: f3bf 8f5f dmb sy
4e2e2: 4670 mov r0, lr
4e2e4: bd08 pop {r3, pc}
It's not really a correctness issue though. Otherwise, LGTM. Thanks!
Diff results for #97792Assembly diffsAssembly diffs for linux/arm ran on windows/x86Diffs are based on 2,268,941 contexts (836,977 MinOpts, 1,431,964 FullOpts). MISSED contexts: base: 75,116 (3.20%), diff: 76,443 (3.26%) Overall (-71,712 bytes)
MinOpts (+2,798 bytes)
FullOpts (-74,510 bytes)
Details here Throughput diffsThroughput diffs for linux/arm ran on windows/x86Overall (-0.42% to -0.04%)
MinOpts (-0.10% to -0.06%)
FullOpts (-0.44% to -0.04%)
Details here |
This should be ready to be merged now with #97902 merged. |
I retested the latest version on Raspberry Pi. |
@dotnet/jit-contrib |
Related to #99011. |
Ping @kunalspathak again for a code review. |
@@ -82,20 +82,29 @@ bool Lowering::IsContainableImmed(GenTree* parentNode, GenTree* childNode) const | |||
{ | |||
case GT_ADD: | |||
case GT_SUB: | |||
#ifdef TARGET_ARM64 | |||
#ifdef TARGET_ARM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better readability, I would suggest to do the following:
#ifdef TARGET_ARM
case GT_CMPXCHG:
return emitter::emitIns_valid_imm_for_cmp(immVal, flags);
case GT_XADD:
return emitter::emitIns_valid_imm_for_add(immVal, flags);
#else
case GT_CMPXCHG:
case GT_XADD:
case GT_LOCKADD:
case GT_XORR:
case GT_XAND:
return comp->compOpportunisticallyDependsOn(InstructionSet_Atomics)
? false
: emitter::emitIns_valid_imm_for_add(immVal, size);
#endif // TARGET_ARM
srcCount = tree->gtGetOp2()->isContained() ? 1 : 2; | ||
|
||
buildInternalIntRegisterDefForNode(tree); | ||
if (tree->OperGet() != GT_XCHG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make the check explicit just to make sure we do not define the internal register for nodes that don't need them in future.
if (tree->OperGet() != GT_XCHG) | |
if (tree->OperGet() == GT_XADD) |
|
||
emitAttr dataSize = emitActualTypeSize(data); | ||
|
||
regNumber tempReg = treeNode->ExtractTempReg(RBM_ALLINT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now use new data structure to extract the temp register in #101647. Please update it accordingly.
|
||
gcInfo.gcMarkRegPtrVal(addrReg, addr->TypeGet()); | ||
|
||
// Emit code like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is done for other platforms as well, but not sure why this cannot be done in lower. The codegen should just emit the code instead of inserting the loop like code here.
// Arguments: | ||
// treeNode - the GT_XADD/XCHG node | ||
// | ||
void CodeGen::genLockedInstructions(GenTreeOp* treeNode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The genLocked*
methods added here looks similar to the one in arm64. can the logic be shared and have just a single method in codegenarmarch.cpp
?
Implements Interlocked Exchange, Add and CompareExchange as "must expand" intrinsics on ARM32 for types equal to or smaller than pointer size.
Contributes to #86915.
Fixes #9982.