Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: cranelift: omit function prologue and epilogue on x64 if possible #8516

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tertsdiepraam
Copy link
Contributor

@tertsdiepraam tertsdiepraam commented May 1, 2024

This makes cranelift omit the function prologue and epilogue on the x64 architecture if possible, like it already does for aarch64.

However, this is not quite right yet, because (if I understand correctly) the is_leaf computation is wrong for x64 when SIMD instructions are involved. With this change, this function gives a segfault:

function %swizzle_i8x16(i8x16, i8x16) -> i8x16 {
block0(v0: i8x16, v1: i8x16):
    v2 = swizzle v0, v1
    return v2
}

I think this happens because it generated this assembly:

pushq   %rbp
movq    %rsp, %rbp
paddusb 0x14(%rip), %xmm1
movabsq $0, %r9
callq   *%r9
movq    %rbp, %rsp
popq    %rbp
retq
addb    %al, (%rax)
jo      0x92
jo      0x94
jo      0x96
jo      0x98
jo      0x9a
jo      0x9c
jo      0x9e
jo      0xa0

Note that there's a callq in there, which means that this is not a leaf function, but according to the architecture-independent is_leaf function of FunctionStencil it looks like a leaf function, because that only considers calls in CLIF.

With ssse3 the call disappears and the test passes.

This means that I need some advice on how to proceed. I could try to make the is_leaf function architecture-dependent or conservatively treat (some?) SIMD instructions as calls in general. Or maybe the information that I need is already computed somewhere and I just need to pass that in? Or I could remove this optimization if ssse3 or later is not supported, but that doesn't seem great 😄

This replaces #5352 and is a follow up of #2960. I'm not sure why I'm getting to the conclusion that SIMD is the issue, but it wasn't mentioned before. Maybe there are other issues as well. Has the testing setup become more robust in the meantime?

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen labels May 1, 2024
@jameysharp
Copy link
Contributor

Neat! I was thinking about working on this soon, because the work that Trevor and I have been doing on cleaning up tail calls has fixed some of the issues that would have made this even harder. But it's even cooler to see somebody else tackling this and I would love to see you continue.

One thing I notice is that cranelift/filetests/filetests/isa/x64/simd-lane-access-compile.clif turns on SSSE3, so the case you're running into isn't represented in our compile tests. That would be good to fix.

After deleting has_ssse3 has_sse41 from that file and re-running the filetests with CRANELIFT_TEST_BLESS=1, I see that this call is a lib-call to X86Pshufb.

That's not the only lib-call we may emit on x86; there are also functions implementing floating-point fused multiply-add (FmaF32/64) and various floating-point rounding modes (Ceil/Floor/Nearest/Trunc), for older CPUs without the corresponding instructions.

So this isn't specifically a SIMD issue, and it doesn't apply to most SIMD instructions either. But I'm also not sure where the segfault you saw came from. Off hand, I don't know why omitting a frame pointer would cause a function call to segfault. So more investigation is in order, I think!

@tertsdiepraam
Copy link
Contributor Author

tertsdiepraam commented May 2, 2024

After deleting has_ssse3 has_sse41 from that file and re-running the filetests with CRANELIFT_TEST_BLESS=1, I see that this call is a lib-call to X86Pshufb.

Yes, that's what I found too.

That's not the only lib-call we may emit on x86; there are also functions implementing floating-point fused multiply-add (FmaF32/64) and various floating-point rounding modes (Ceil/Floor/Nearest/Trunc), for older CPUs without the corresponding instructions.

My (unsubstantiated) guess is that the X86Pshufb function is more complex than those other lib-calls and therefore causes trouble. Replacing the lib call with a custom function in the final x86 does not seem to cause any trouble, so it's not any call that leads to a segfault.

Is this the code for the libcall?

pub extern "C" fn x86_pshufb(a: __m128i, b: __m128i) -> __m128i {

At the top of that file, it says:

They must only contain basic, raw i32/i64/f32/f64/pointer parameters that are safe to pass across the system ABI.

Maybe part of the problem is that x86pshufb requires __m128i as arguments? Or that it requires bumping some things on the stack while the others don't? If I throw that libcall in godbolt, it does show that it pushes some thing to the stack, which might then be at the wrong location. I don't really know, I'm just guessing at this point.

@afonso360
Copy link
Contributor

afonso360 commented May 2, 2024

From what I remember when I looked into this a year or so ago was that this was an issue with stack alignment. I think that the libcalls expect a 16byte aligned stack on entry (due to ABI) and will use aligned loads and stores on the stack directly, and we weren't respecting that when omitting the prologue.

Take this with a grain of salt, since it has been a while since I last looked at this.

Edit: I should also note that not all libcalls will come from wasmtime/crates/wasmtime/src/runtime/vm/libcalls.rs. They can be provded by the system's libc or whatever the linker decides, in cases where cranelift isn't used with wasmtime.

@tertsdiepraam
Copy link
Contributor Author

Interesting! @afonso360, do you know of a way that I could check that hypothesis?

I should also note that not all libcalls will come from wasmtime/crates/wasmtime/src/runtime/vm/libcalls.rs. They can be provded by the system's libc or whatever the linker decides, in cases where cranelift isn't used with wasmtime.

Good to know! Thanks!

I wonder if there's some easy way that we can enable this for everything that doesn't involve libcalls? Is there a way to check whether a function will contain libcalls?

@afonso360
Copy link
Contributor

afonso360 commented May 2, 2024

Well, I'd check if the code segfaults in a 16byte aligned load inside the libcall that is referencing the stack and that, that address is under-aligned (i.e. only aligned to 8 bytes or less).

And additionally the stack at the entrypoint of that libcall was only aligned to 8 (this by itself is an ABI breakage).

You should be able to do this by running the test under gdb/lldb and setting a breakpoint in the libcall function and checking what the stack pointer is, and where it first crashes, etc...

It would also be nice if we had a libcall that didn't use any SSE instructions or referenced the stack, that would be also another clue that our issue was with that, but we don't.


I wonder if there's some easy way that we can enable this for everything that doesn't involve libcalls? Is there a way to check whether a function will contain libcalls?

Unfortunately I don't think so. Since these libcalls only end up being placed during lowering if we don't have a direct instruction for that, it means that any instruction could end up being lowered to a libcall. In practice there are only a few situations where we do that.

One way that came to mind was to complete #7321 and with the legalizations in place, we could simply check if there is a call opcode in the function since any opcode that we couldn't implement would be legalized into a call.

Otherwise, I'm not too sure, but people might have some ideas.

@jameysharp
Copy link
Contributor

We compute the frame layout after lowering to vcode, so I think we can make the decision about whether this is a leaf function after we've already made all the decisions about whether to use any libcalls. We "just" need to thread that information through correctly somehow.

@tertsdiepraam
Copy link
Contributor Author

That sounds reasonable. I'll give that a shot!

@jameysharp
Copy link
Contributor

Oh, I also wanted to note: Good points, both of you, on stack alignment and vector arguments. I hadn't thought about either of those things and I think together they fully explain this symptom. I want to write down this conclusion for future reference although I don't think either of you need the explanation 😁

Pushing the return address leaves the stack 8-aligned, and normally pushing the frame pointer would make it 16-aligned again, but we skip that step here. Then, most x86 code can actually tolerate 8-aligned stacks, but moving between MMX registers and memory generally requires 16-alignment, and indeed it looks like Rust/LLVM are producing movdqa instructions with 16-aligned offsets from the stack pointer in this particular libcall.

So that explains why you'd only see problems with this specific call. We still should ensure that all libcalls get the ABI-required stack alignment though, so the plan to have any function with a libcall not be a leaf function is still a good one.

@tertsdiepraam
Copy link
Contributor Author

I did in fact need that conclusion 😄 Thanks!

@tertsdiepraam
Copy link
Contributor Author

I'm looking at this again and I might need some help. I'm struggling to find where this information should originate from. Is there some struct that can keep track of a boolean or should we scan over the emitted instructions? Or maybe the dfg should also keep track of the libcalls that have been made in addition to the signatures and ext_funcs?

I'm a bit overwhelmed at the moment :)

@jameysharp
Copy link
Contributor

I'm looking at this again and I might need some help. I'm struggling to find where this information should originate from. Is there some struct that can keep track of a boolean …

Yeah, totally reasonable questions!

I would start by looking at how outgoing_args_size and tail_args_size get computed and used, since those are also properties of the function calls that appear in the current function.

In cranelift/codegen/src/machinst/abi.rs, there are methods on Callee named accumulate_outgoing_args_size and accumulate_tail_args_size. You can look for where these are called. Those are places where, while lowering the body of the function, we found a function call. I don't remember exactly but libcalls should be similar.

Note that if you're using rust-analyzer to navigate, there are some nearby parts of the call-graph that it can't find because they're in macros, so if you hit a point where things seem to be missing you might also want to try just searching for the function's name. Someday I hope to reduce our usage of macros…

Those *_args_size fields are stored on the Callee until Callee::compute_frame_layout is called, and then they're forwarded to the backend's implementation of ABIMachineSpec::compute_frame_layout.

There's another field on Callee that's also forwarded, is_leaf. Maybe the thing to do here is just set that to false if there's any kind of function call, including libcalls.

@jameysharp
Copy link
Contributor

After a little more investigation: the is_leaf field is currently set by calling FunctionStencil::is_leaf, which tries to guess whether there are function calls or libcalls, and which is not used anywhere else. Let's just delete that method, and compute the true answer during lowering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants