New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HACKING.adoc: add tip about using linscan #10591
Conversation
@xavierleroy this is the file I mentioned during ICFP that takes ages to compile on RISC-V (83s just for this file). For now I'd suggest adding a note to HACKING.adoc about it. |
Timings for a full ./configure --disable-ocamldoc && make -j4 on FU740: * default: 1183s * linscan: 997s * flambda: 1465s Building driver/main_args with ocamlopt.opt takes ages: ``` 83.200s driver/main_args.ml 00.127s parsing 00.127s parser 00.001s other 01.398s typing 00.143s transl 81.531s generate 00.058s cmm 80.507s compile_phrases 00.004s cmm_invariants 00.317s selection 00.020s comballoc 00.886s cse 00.399s liveness 00.094s deadcode 00.530s spill 00.098s split 77.901s regalloc 00.028s linearize 00.003s scheduling 00.143s emit 00.085s other 00.612s assemble 00.355s other 00.130s other ``` Using linscan is so much faster: ``` 6.507s driver/main_args.ml 0.125s parsing 0.124s parser 1.387s typing 0.142s transl 4.852s generate 0.057s cmm 3.796s compile_phrases 0.004s cmm_invariants 0.310s selection 0.029s comballoc 0.975s cse 0.309s liveness 0.084s deadcode 0.223s spill 0.073s split 1.532s regalloc 0.022s linearize 0.003s scheduling 0.150s emit 0.083s other 0.645s assemble 0.354s other 0.126s other ``` Another way to speed up compilation of that file is to use an flambda build. Although more time spent is spent in flambda it gives less work to the register allocator: ``` 6.176s driver/main_args.ml 0.115s parsing 0.115s parser 0.001s other 1.353s typing 0.096s transl 4.612s generate 2.511s flambda 2.511s middle_end 0.256s closure_conversion 0.134s lift_lets 1 0.811s Lift_constants 0.063s Share_constants 0.024s Remove_unused_program_constructs 0.134s Lift_let_to_initialize_symbol 0.109s lift_lets 2 0.039s Remove_unused_closure_vars 1 0.488s Inline_and_simplify 0.018s Remove_unused_closure_vars 2 0.061s lift_lets 3 0.292s Inline_and_simplify noinline 0.024s Remove_unused_closure_vars 3 0.032s Ref_to_variables 0.002s Initialize_symbol_to_let_symbol 0.020s Remove_unused_closure_vars 0.004s other 0.154s backend 0.037s cmm 0.895s compile_phrases 0.003s cmm_invariants 0.067s selection 0.015s comballoc 0.037s cse 0.061s liveness 0.015s deadcode 0.059s spill 0.022s split 0.352s regalloc 0.017s linearize 0.051s scheduling 0.100s emit 0.096s other 0.449s assemble 0.567s other 0.156s other ``` However overall flambda is slower than the default. No change entry needed. Signed-off-by: Edwin Török <edwin@etorok.net>
Interesting. Maybe there's just not enough RAM and the compilation starts swapping? At any rate, here are the timings I observe on a robust Linux PC (AMD Ryzen 3700X, 32G RAM). You'll see that register allocation takes time, but only 3 times as much as typechecking, not 55 times as in your measurements.
|
Thinking out loud: we could have a module-level attribute to ask the compiler to use linscan for this module. (We could have a function-level attribute as well, but it is more work to implement and it does not work for the module-initialization code, which is often the culprit of bad regalloc behavior.) We would use the attribute on modules that are known to not play that well with the default allocator. |
I have no problem with register allocation on amd64 either, the problem is only on RISC-V. No swap activity (there is 15GiB of mem free), and most time spent in userspace:
RISC-V doesn't have a fully working perf implementation yet (hardware perf counters lack a way to trigger interrupts), but software events can be used to measure perf:
which can then be processed with flamegraph scripts:
Here are 2 flamegraphs showing that Coloring.walk takes a long time: My guess is that this happens because RISC-V has more allocatable registers than amd64, and the register allocator's complexity depends on the number of available registers in a non-linear way, but I haven't verified that. |
I confirm that compiling
By comparison,
|
Since performance is rarely a problem for the module initialization code, we could also decide to use linscan by default for module initialization, and the standard register allocator for functions. |
Why not, but (1) we don't know that module-initialization is causing the issue here, and (2) I'm uneasy about a default that uses two different register allocators for all programs, that sounds like doubling the complexity budget and bug surface for this part of the compiler. (Now that some people have decent benchmark infrastructures for OCaml programs, it would be interesting to get some objective performance numbers on the overhead of linscan.) |
You're correct that the graph coloring register allocator has a number of non-linear behaviors... It is definitely the case that RISC-V has more allocatable registers than AMD64. However, ARM64 is very comparable to RISC-V in this department, and when I compile
So, there's a performance bug very specific to RISC-V that must be investigated and understood. Only after we can discuss changes. |
I investigated a bit: the main issue seems to be that under RISC-V the liveness information for |
The problem is that CSE was being applied to integer constants outside the range of immediate operands which was significantly lengthening the life of temporaries holding such constants. See #10608 for the fix. |
Thanks a lot for tracking this down, it is indeed a lot faster now on native RISC-V too:
|
For information: the same issue occurs on ARM 32 bits, with the same consequences (register allocation takes a long time on main_args.ml).. |
If I understand correctly, if #10608 was to be approved as a correct fix and merged, we could close the present issue. |
Never do CSE for constants that fit in 32 bits signed. Do it only for constants that do not fit, and only on some 64-bit targets where the code generated for Iconst_int can be costly (ARM64, POWER, RISC-V). Fixes: #10591
Timings for a full ./configure --disable-ocamldoc && make -j4 on FU740:
Building driver/main_args with ocamlopt.opt takes ages:
Using linscan is so much faster:
Another way to speed up compilation of that file is to use an flambda build.
Although more time spent is spent in flambda it gives less work to the
register allocator:
However overall flambda is slower than the default.