Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small difference makes suspicion performance decreasing #8573

Open
hungryzzz opened this issue May 7, 2024 · 20 comments
Open

Small difference makes suspicion performance decreasing #8573

hungryzzz opened this issue May 7, 2024 · 20 comments

Comments

@hungryzzz
Copy link
Contributor

hungryzzz commented May 7, 2024

Test Cases

cases.zip

Steps to Reproduce

Hi, I run the attached two cases(good.wasm&bad.wasm) in Wasmtime, WasmEdge(AOT) and Wamr(AOT), and collect their execution time respectively (measured by time tool).

Expected Results & Actual Results

For good.wasm, the execution time in different runtimes are as follows:

  • Wasmtime: 4.67s
  • WasmEdge: 2.81s
  • Wamr: 3.29s

For bad.wasm, the execution time in different runtimes are as follows:

  • Wasmtime: 9.34s
  • WasmEdge: 2.35s
  • Wamr: 3.29s

The difference between the attached two cases is as follow, i.e., changing the address of i32.load from local.get 0 to i32.const 0, the case with i32.const 0 has the bad performance in Wasmtime, whose execution time is 2x slower than the former one. I think the decreasing is confused and suspicious, why is an immediate operation slower than a local variable operation? Moreover, the i32.load is out of the loop, I cannot understand why the operation could affect the overall performance so vastly.

截屏2024-05-07 23 19 36

Versions and Environment

Wasmtime version or commit: d0cf46a

Operating system: Linux ringzzz-OptiPlex-7070 5.15.0-97-generic

Architecture: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz

@cfallin
Copy link
Member

cfallin commented May 7, 2024

cc @fitzgen , perhaps this is something to do with our bounds-checking cases?

@fitzgen
Copy link
Member

fitzgen commented May 7, 2024

@hungryzzz can you share how you are configuring and executing the Wasm in Wasmtime? I.e. the exact flags you are passing to the CLI or the exact Rust code that you are using when embedding Wasmtime if you aren't using the CLI.

cc @fitzgen , perhaps this is something to do with our bounds-checking cases?

With the default configuration of Wasmtime we should be eliding basically all bounds checks, I'd be surprised if this was bounds-check related unless guard pages are being disabled or something.

@hungryzzz
Copy link
Contributor Author

@fitzgen I use the CLI to run the case, the command I use is wasmtime good.wasm without extra flags.

@fitzgen
Copy link
Member

fitzgen commented May 7, 2024

@hungryzzz ah okay, that means that you are measuring compilation time as well, not just execution time.

If you want to measure execution performance specifically, you should instead do something like

$ wasmtime compile test.wasm -o test.cwasm
$ time wasmtime run --allow-precompiled test.cwasm

Wasmtime is technically a JIT in the sense that it can compile code on demand and map it executable into the same process that did the compiling right before executing the code. But it is not a JIT in the same sense that engines like V8 and SpiderMonkey are, where there are speculative optimizations, lazy tiering up based on which functions are hot, dynamic deoptimization when speculative assumptions fail, etc... Given all that, I'd highly recommend the two-phase process I sketched above, where the Wasm is first compiled and then second the compiled code is executed (and this second part is what you are timing).

@hungryzzz
Copy link
Contributor Author

hungryzzz commented May 8, 2024

@fitzgen Thank you for your advice, I recollect the execution time with the above commands and the results are as follows. I guess the JITed code would be cached after the first compilation, so the execution results are nearly the same as the above one.

case ✗ wasmtime compile good.wasm -o good.cwasm
➜  case ✗ wasmtime compile bad.wasm -o bad.cwasm
➜  casetime wasmtime run --allow-precompiled good.cwasm
~/wasmtime/target/release/wasmtime run  good.cwasm  4.67s user 0.00s system 100% cpu 4.674 total
➜  casetime wasmtime run --allow-precompiled bad.cwasm
~/wasmtime/target/release/wasmtime run  bad.cwasm  9.36s user 0.01s system 100% cpu 9.365 total

@cfallin
Copy link
Member

cfallin commented May 8, 2024

@hungryzzz a request for your posts: would you mind posting your results as plain text (or formatted as appropriate), rather than PNG screenshots of a terminal?

There are several reasons for this:

  • Plain text shows up in email clients for folks who have GitHub email notifications, whereas images may not load at all, or require an extra action to load (personally the case for me: I have to click through to see your content)
  • Plain text follows the configuration of the user's browser, including font-sizes, color and contrast settings, etc., some of which are not just cosmetic but necessary for folks with vision issues (which will be most of us eventually!)
  • Plain text can be searched (local Ctrl-F, and also indexed by search engines and GitHub's search functionality), copied and pasted, and otherwise handled as needed, whereas a PNG is just pixels
  • Plain text is visible to screen-readers and other tools that inspect the semantic page contents, whereas an image is not

It may take ever so slightly longer (copy & paste, use markdown triple-backtick quotes) but it's a great kindness to readers here for all the reasons above, and we want to set a standard here for maximally accessible and usable discussion content. Thanks!

(EDIT: thanks for your edit! I'll leave this here for posterity)

@primoly
Copy link

primoly commented May 9, 2024

Probably irrelevant, but could code alignment play a role? I noticed that both modules compile to basically the same instructions, just assigned different registers. Yet for func1, good gets more operations aligned to four bytes than bad.

bad.cwasm func1 x86-64: 60 instructions, 17 start at addresses divisible by four (28%)
good.cwasm func1 x86-64: 62 instructions, 25 start at addresses divisible by four (40%)

Unsurprisingly no difference for func2 since it generates identical instructions for better or worse... I mean good and bad.

@hungryzzz
Copy link
Contributor Author

I use Perf tool to profile the execution time and surprisingly find that the hotspot is not in func1 where the small difference happens, but in the func2. I continue to check the binary compiled by Cranelift and find that the binary of func2 is the same before and after the small difference. It is really confused to me how the difference in func1 could affect the execution of func2.

Samples: 38K of event 'cycles', Event count (approx.): 34572923013
Overhead  Command          Shared Object        Symbol
  74.91%  wasmtime         jitted-4044282-1.so  [.] 000000000000004f                                                                                             ◆
  24.63%  wasmtime         jitted-4044282-1.so  [.] 0000000000000071                                                                                             ▒
   0.32%  wasmtime         jitted-4044282-1.so  [.] 0x0000000000000067                                                                                           ▒
   0.01%  wasmtime         [kernel.kallsyms]    [k] __handle_mm_fault                                                                                            ▒
   0.01%  wasmtime         [kernel.kallsyms]    [k] sync_regs                                                                                                    ▒
   0.01%  wasmtime         [kernel.kallsyms]    [k] native_irq_return_iret                                                                                       ▒
   0.01%  wasmtime         [kernel.kallsyms]    [k] vm_unmapped_area                                                                                             ▒
Samples: 38K of event 'cycles', 4000 Hz, Event count (approx.): 34572923013
000000000000004f  /home/ringzzz/performance_debug/case1-wasmtime/wasmtime-perf/jitted-4044282-1.so [Percent: local period]
Percent│
       │
       │
       │     Disassembly of section .text:
       │
       │     0000000000000040 <wasm[0]::function[2]>:
       │     wasm[0]::function[2]():
       │       push %rbp
       │       mov  %rsp,%rbp
       │       xor  %eax,%eax
       │       mov  %rax,%r8
       │ 3a:   cmp  $0xffff,%eax
       │     ↓ jne  58
       │       mov  0x80(%rdi),%rdi
       │       movl $0x1,(%rdi)
       │       mov  %rbp,%rsp
       │       pop  %rbp
       │     ← retq
       │ 58:   lea  0x1(%r8),%edx
       │       mov  %r8,%rax
       │       mov  %rdx,%r8
       │     ↑ jmpq 3a
# the commands to profile
perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm
perf inject --jit --input perf.data --output perf.jit.data
perf report --input perf.jit.data --no-children

@fitzgen
Copy link
Member

fitzgen commented May 14, 2024

It is really confused to me how the difference in func1 could affect the execution of func2.

Is it changing the alignment of func2? We have both preferred and required alignments inside Cranelift, and it is possible that something is getting bumped from its preferred down to just its required alignment.

@alexcrichton
Copy link
Member

Testing locally I've got the "good" function 2 starting at 0x140 and "bad" starts at 0x130. That makes sense to me because we align functions to 16-bytes and the "bad" version has more constant propagation so function is a bit smaller in the "bad" version, hence starting a bit earlier.

This program appears to be effectively a benchmark of the second function, which you've pasted the machine code above. In perf locally the jmpq and jne instructions are the hottest instructions. My suspicion is that this is CPU-specific for you (I can't reproduce the slowdown myself) where it's falling over some branch predictor heuristics.

That being said in general the "function 2" here is a pretty bad layout choice for cranelift. The loop is spread across the epilogue instead of tightly put together which might omit a jump. Cranelift can probably do better here in terms of block layout on the second function and that might help branch predictor heuristics on your CPU as well

@cfallin
Copy link
Member

cfallin commented May 14, 2024

FWIW, the usual "fetch bandwidth sin" to avoid is forcing partial fetch lines -- the CPU frontend grabs an aligned 32B or 64B chunk at a time so if a function starts halfway through a fetch line, we're wasting frontend bandwidth. I suspect a 32B function alignment would be a pretty reasonable default in general -- is there a reason we only require 16B now? Did we find it mattered a lot for object file bloat or somesuch?

@alexcrichton
Copy link
Member

Looks like the 16-byte alignment was added in #4826 as the minimum require for functions at the time to properly align constants in the constant pool. I don't think 32-bytes was specifically rejected in favor of a smaller size. Looks. like aarch64 is aligned to 32-bytes.

@cfallin
Copy link
Member

cfallin commented May 14, 2024

Cool -- @hungryzzz, I think we'd be happy to take a PR to change x86-64's function alignment to 32 bytes as well, if you'd like to create one!

@hungryzzz
Copy link
Contributor Author

hungryzzz commented May 15, 2024

Is it changing the alignment of func2?

Yes, the machine codes of func2 before and after changes are the same excpet the start address of the func2.

# machine code of `func2` in `bad.wasm` generated by Cranelift
0000000000000130 <wasm[0]::function[2]>:
     130:	55                   	push   %rbp
     131:	48 89 e5             	mov    %rsp,%rbp
     134:	31 c0                	xor    %eax,%eax
     136:	49 89 c0             	mov    %rax,%r8
     139:	81 f8 ff ff 00 00    	cmp    $0xffff,%eax
     13f:	0f 85 12 00 00 00    	jne    157 <wasm[0]::function[2]+0x27>
     145:	48 8b bf 80 00 00 00 	mov    0x80(%rdi),%rdi
     14c:	c7 07 01 00 00 00    	movl   $0x1,(%rdi)
     152:	48 89 ec             	mov    %rbp,%rsp
     155:	5d                   	pop    %rbp
     156:	c3                   	retq   
     157:	41 8d 50 01          	lea    0x1(%r8),%edx
     15b:	4c 89 c0             	mov    %r8,%rax
     15e:	49 89 d0             	mov    %rdx,%r8
     161:	e9 d3 ff ff ff       	jmpq   139 <wasm[0]::function[2]+0x9>

# machine code of `func2` in `good.wasm` generated by Cranelift
0000000000000120 <wasm[0]::function[2]>:
     120:	55                   	push   %rbp
     121:	48 89 e5             	mov    %rsp,%rbp
     124:	31 c0                	xor    %eax,%eax
     126:	49 89 c0             	mov    %rax,%r8
     129:	81 f8 ff ff 00 00    	cmp    $0xffff,%eax
     12f:	0f 85 12 00 00 00    	jne    147 <wasm[0]::function[2]+0x27>
     135:	48 8b bf 80 00 00 00 	mov    0x80(%rdi),%rdi
     13c:	c7 07 01 00 00 00    	movl   $0x1,(%rdi)
     142:	48 89 ec             	mov    %rbp,%rsp
     145:	5d                   	pop    %rbp
     146:	c3                   	retq   
     147:	41 8d 50 01          	lea    0x1(%r8),%edx
     14b:	4c 89 c0             	mov    %r8,%rax
     14e:	49 89 d0             	mov    %rdx,%r8
     151:	e9 d3 ff ff ff       	jmpq   129 <wasm[0]::function[2]+0x9>

In perf locally the jmpq and jne instructions are the hottest instructions. My suspicion is that this is CPU-specific for you (I can't reproduce the slowdown myself) where it's falling over some branch predictor heuristics.

Hi, could I ask why the function alignment could affect the CPU branch prediction? Thanks!

@cfallin
Copy link
Member

cfallin commented May 15, 2024

Hi, could I ask why the function alignment could affect the CPU branch prediction? Thanks!

I addressed this a few comments up:

FWIW, the usual "fetch bandwidth sin" to avoid is forcing partial fetch lines -- the CPU frontend grabs an aligned 32B or 64B chunk at a time so if a function starts halfway through a fetch line, we're wasting frontend bandwidth. I suspect a 32B function alignment would be a pretty reasonable default in general -- is there a reason we only require 16B now? Did we find it mattered a lot for object file bloat or somesuch?

The problem isn't actually with branch prediction, but with the way the instruction fetch works on most modern superscalar machines. (In more detail, fetches are big aligned chunks because it's fewer wires and transistors to grab only an aligned chunk from an icache line rather than any arbitrary alignment with arbitrary shifting.)

Would you be willing to send a PR to update the alignment on x86-64? Someone else can do it if not, but hopefully it's an easy-ish change!

@hungryzzz
Copy link
Contributor Author

Would you be willing to send a PR to update the alignment on x86-64? Someone else can do it if not, but hopefully it's an easy-ish change!

Hi, I am really willing to do it, but I am not familiar with Rust and code generation implementation, could you please give me some references or advices to help me start it?

@cfallin
Copy link
Member

cfallin commented May 15, 2024

I think it's just this constant that needs to be changed; probably with a comment update noting that we're aligning for better formance and a link to this issue?

@hungryzzz
Copy link
Contributor Author

The problem isn't actually with branch prediction, but with the way the instruction fetch works on most modern superscalar machines.

The func2 would be called many times, in my opinion, once the instructions of func2 have been fetched into the icache, wouldn't it be necessary to fetch them again when calling func2 again?

@hungryzzz
Copy link
Contributor Author

I think it's just this constant that needs to be changed; probably with a comment update noting that we're aligning for better formance and a link to this issue?

Thank you very much! I'll try it!

@cfallin
Copy link
Member

cfallin commented May 15, 2024

The problem isn't actually with branch prediction, but with the way the instruction fetch works on most modern superscalar machines.

The func2 would be called many times, in my opinion, once the instructions of func2 have been fetched into the icache, wouldn't it be necessary to fetch them again when calling func2 again?

Fetch is part of the CPU pipeline, and happens every time an instruction executes: the instructions are fetched from the icache into the pipeline (and then decoded, renamed, scheduled, executed, ...). So fetch bottlenecks affect even inner-loop performance. The step that puts instructions into the icache on a miss is usually called "cache fill" and happens in a separate pipeline.

(This is slightly less true on machines with uop caches and/or loop stream buffers, but those have their own limitations as well)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants