Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HACKING.adoc: add tip about using linscan #10591

Closed
wants to merge 1 commit into from

Conversation

edwintorok
Copy link
Contributor

Timings for a full ./configure --disable-ocamldoc && make -j4 on FU740:

  • default: 1183s
  • linscan: 997s
  • flambda: 1465s

Building driver/main_args with ocamlopt.opt takes ages:

83.200s driver/main_args.ml
  00.127s parsing
    00.127s parser
    00.001s other
  01.398s typing
  00.143s transl
  81.531s generate
    00.058s cmm
    80.507s compile_phrases
      00.004s cmm_invariants
      00.317s selection
      00.020s comballoc
      00.886s cse
      00.399s liveness
      00.094s deadcode
      00.530s spill
      00.098s split
      77.901s regalloc
      00.028s linearize
      00.003s scheduling
      00.143s emit
      00.085s other
    00.612s assemble
    00.355s other
00.130s other

Using linscan is so much faster:

6.507s driver/main_args.ml
  0.125s parsing
    0.124s parser
  1.387s typing
  0.142s transl
  4.852s generate
    0.057s cmm
    3.796s compile_phrases
      0.004s cmm_invariants
      0.310s selection
      0.029s comballoc
      0.975s cse
      0.309s liveness
      0.084s deadcode
      0.223s spill
      0.073s split
      1.532s regalloc
      0.022s linearize
      0.003s scheduling
      0.150s emit
      0.083s other
    0.645s assemble
    0.354s other
0.126s other

Another way to speed up compilation of that file is to use an flambda build.
Although more time spent is spent in flambda it gives less work to the
register allocator:

6.176s driver/main_args.ml
  0.115s parsing
    0.115s parser
    0.001s other
  1.353s typing
  0.096s transl
  4.612s generate
    2.511s flambda
      2.511s middle_end
        0.256s closure_conversion
        0.134s lift_lets 1
        0.811s Lift_constants
        0.063s Share_constants
        0.024s Remove_unused_program_constructs
        0.134s Lift_let_to_initialize_symbol
        0.109s lift_lets 2
        0.039s Remove_unused_closure_vars 1
        0.488s Inline_and_simplify
        0.018s Remove_unused_closure_vars 2
        0.061s lift_lets 3
        0.292s Inline_and_simplify noinline
        0.024s Remove_unused_closure_vars 3
        0.032s Ref_to_variables
        0.002s Initialize_symbol_to_let_symbol
        0.020s Remove_unused_closure_vars
        0.004s other
    0.154s backend
    0.037s cmm
    0.895s compile_phrases
      0.003s cmm_invariants
      0.067s selection
      0.015s comballoc
      0.037s cse
      0.061s liveness
      0.015s deadcode
      0.059s spill
      0.022s split
      0.352s regalloc
      0.017s linearize
      0.051s scheduling
      0.100s emit
      0.096s other
    0.449s assemble
    0.567s other
0.156s other

However overall flambda is slower than the default.

@edwintorok
Copy link
Contributor Author

@xavierleroy this is the file I mentioned during ICFP that takes ages to compile on RISC-V (83s just for this file). For now I'd suggest adding a note to HACKING.adoc about it.

Timings for a full ./configure --disable-ocamldoc && make -j4 on FU740:
* default: 1183s
* linscan: 997s
* flambda: 1465s

Building driver/main_args with ocamlopt.opt takes ages:
```
83.200s driver/main_args.ml
  00.127s parsing
    00.127s parser
    00.001s other
  01.398s typing
  00.143s transl
  81.531s generate
    00.058s cmm
    80.507s compile_phrases
      00.004s cmm_invariants
      00.317s selection
      00.020s comballoc
      00.886s cse
      00.399s liveness
      00.094s deadcode
      00.530s spill
      00.098s split
      77.901s regalloc
      00.028s linearize
      00.003s scheduling
      00.143s emit
      00.085s other
    00.612s assemble
    00.355s other
00.130s other
```

Using linscan is so much faster:
```
6.507s driver/main_args.ml
  0.125s parsing
    0.124s parser
  1.387s typing
  0.142s transl
  4.852s generate
    0.057s cmm
    3.796s compile_phrases
      0.004s cmm_invariants
      0.310s selection
      0.029s comballoc
      0.975s cse
      0.309s liveness
      0.084s deadcode
      0.223s spill
      0.073s split
      1.532s regalloc
      0.022s linearize
      0.003s scheduling
      0.150s emit
      0.083s other
    0.645s assemble
    0.354s other
0.126s other
```

Another way to speed up compilation of that file is to use an flambda build.
Although more time spent is spent in flambda it gives less work to the
register allocator:
```
6.176s driver/main_args.ml
  0.115s parsing
    0.115s parser
    0.001s other
  1.353s typing
  0.096s transl
  4.612s generate
    2.511s flambda
      2.511s middle_end
        0.256s closure_conversion
        0.134s lift_lets 1
        0.811s Lift_constants
        0.063s Share_constants
        0.024s Remove_unused_program_constructs
        0.134s Lift_let_to_initialize_symbol
        0.109s lift_lets 2
        0.039s Remove_unused_closure_vars 1
        0.488s Inline_and_simplify
        0.018s Remove_unused_closure_vars 2
        0.061s lift_lets 3
        0.292s Inline_and_simplify noinline
        0.024s Remove_unused_closure_vars 3
        0.032s Ref_to_variables
        0.002s Initialize_symbol_to_let_symbol
        0.020s Remove_unused_closure_vars
        0.004s other
    0.154s backend
    0.037s cmm
    0.895s compile_phrases
      0.003s cmm_invariants
      0.067s selection
      0.015s comballoc
      0.037s cse
      0.061s liveness
      0.015s deadcode
      0.059s spill
      0.022s split
      0.352s regalloc
      0.017s linearize
      0.051s scheduling
      0.100s emit
      0.096s other
    0.449s assemble
    0.567s other
0.156s other
```

However overall flambda is slower than the default.

No change entry needed.

Signed-off-by: Edwin Török <edwin@etorok.net>
@xavierleroy
Copy link
Contributor

Interesting. Maybe there's just not enough RAM and the compilation starts swapping? At any rate, here are the timings I observe on a robust Linux PC (AMD Ryzen 3700X, 32G RAM). You'll see that register allocation takes time, but only 3 times as much as typechecking, not 55 times as in your measurements.

1.913s driver/main_args.ml
  0.043s parsing
    0.043s parser
  0.291s typing
  0.043s transl
  1.537s generate
    0.006s cmm
    1.394s compile_phrases
      0.001s cmm_invariants
      0.092s selection
      0.003s polling
      0.002s comballoc
      0.165s cse
      0.035s liveness
      0.007s deadcode
      0.070s spill
      0.011s split
      0.976s regalloc
      0.002s linearize
      0.001s scheduling
      0.020s emit
      0.010s other
    0.040s assemble
    0.096s other
0.019s other

@gasche
Copy link
Member

gasche commented Aug 30, 2021

Thinking out loud: we could have a module-level attribute to ask the compiler to use linscan for this module. (We could have a function-level attribute as well, but it is more work to implement and it does not work for the module-initialization code, which is often the culprit of bad regalloc behavior.) We would use the attribute on modules that are known to not play that well with the default allocator.

@edwintorok
Copy link
Contributor Author

I have no problem with register allocation on amd64 either, the problem is only on RISC-V. No swap activity (there is 15GiB of mem free), and most time spent in userspace:

         82046.90 msec task-clock                #    1.000 CPUs utilized          
               176      context-switches          #    2.145 /sec                   
                 1      cpu-migrations            #    0.012 /sec                   
             44131      page-faults               #  537.875 /sec                   
       98122897529      cycles                    #    1.196 GHz                    
       31127094096      instructions              #    0.32  insn per cycle         
   <not supported>      branches                                                    
   <not supported>      branch-misses                                               

      82.068373321 seconds time elapsed

      81.169846000 seconds user
       0.875834000 seconds sys

RISC-V doesn't have a fully working perf implementation yet (hardware perf counters lack a way to trigger interrupts), but software events can be used to measure perf:

perf record -e cpu-clock -F 47 --call-graph=dwarf,65528 -v ./ocamlopt.opt -dtimings -g -nostdlib -I stdlib -I otherlibs/dynlink -strict-sequence -principal -absname -w +a-4-9-40-41-42-44-45-48-66-70 -warn-error +a -bin-annot -safe-string -strict-formats -I utils -I parsing -I typing -I bytecomp -I file_formats -I lambda -I middle_end -I middle_end/closure -I middle_end/flambda -I middle_end/flambda/base_types -I asmcomp -I driver -I toplevel -c driver/main_args.ml -I driver -S
perf script >profile.txt

which can then be processed with flamegraph scripts:

./stackcollapse-perf.pl <../profile.txt | ./stackcollapse-recursive.pl >|prof.txt
./flamegraph.pl prof.txt >prof.svg
./flamegraph --reverse --inverted prof.txt >profi.svg

Here are 2 flamegraphs showing that Coloring.walk takes a long time:
https://gist.githubusercontent.com/edwintorok/08d9745819b44cc55ae8fe58755956a0/raw/eac4b7d359bc4ae5998defe1109597346ce3fe6c/prof.svg
https://gist.githubusercontent.com/edwintorok/08d9745819b44cc55ae8fe58755956a0/raw/eac4b7d359bc4ae5998defe1109597346ce3fe6c/profi.svg

My guess is that this happens because RISC-V has more allocatable registers than amd64, and the register allocator's complexity depends on the number of available registers in a non-linear way, but I haven't verified that.
I'll try artificially reducing the number of available registers to the same count as amd64 and see how that behaves.

@nojb
Copy link
Contributor

nojb commented Aug 30, 2021

I confirm that compiling main_args.ml on RISC-V indeed looks a bit fishy. In the VM that we use for CI it takes ~250s (using bytecode ocamlopt):

[riscv@fedora-riscv ocaml]$ ./boot/ocamlrun ./ocamlopt -dprofile -g -nostdlib -I stdlib -I otherlibs/dynlink -strict-sequence -principal -absname -w +a-4-9-40-41-42-44-45-48-66-70 -warn-error +a -bin-annot -safe-string -strict-formats -I utils -I parsing -I typing -I b
ytecomp -I file_formats -I lambda -I middle_end -I middle_end/closure -I middle_end/flambda -I middle_end/flambda/base_types -I asmcomp -I driver -I toplevel  -c driver/main_args.ml -I driver
251.507s 5.67GB  139MB  143MB driver/main_args.ml
  000.446s 0.01GB 1.05MB 4.33MB parsing
    000.446s 0.01GB 1.05MB 4.33MB parser
    000.001s ------ ------ ------ other
  002.536s 0.06GB 10.9MB 15.2MB typing
  000.341s 0.01GB ------ ------ transl
  248.183s 5.58GB  127MB  143MB generate
    000.080s ------ 2.63MB 17.9MB cmm
    246.840s 5.56GB  122MB  140MB compile_phrases
      000.024s ------ ------ ------ cmm_invariants
      000.823s 0.04GB 6.52MB 24.4MB selection
      000.049s ------ ------ ------ polling
      000.044s ------ ------ ------ comballoc
      004.829s 0.03GB 4.01MB 28.4MB cse
      000.718s 0.01GB 5.30MB 33.7MB liveness
      000.160s ------ ------ ------ deadcode
      002.511s 0.02GB 4.61MB 38.3MB spill
      000.238s ------ ------ ------ split
      236.638s 5.42GB  102MB  140MB regalloc
      000.047s ------ ------ ------ linearize
      000.023s ------ ------ ------ scheduling
      000.324s ------ ------ ------ emit
      000.412s ------ ------ ------ other
    000.763s ------ ------ ------ assemble
    000.501s 0.01GB 2.29MB  143MB other
000.398s ------ 3.27MB ------ other

By comparison, typecore.ml takes ~30s

[riscv@fedora-riscv ocaml]$ ./boot/ocamlrun ./ocamlopt -dprofile -g -nostdlib -I stdlib -I otherlibs/dynlink -strict-sequence -principal -absname -w +a-4-9-40-41-42-44-45-48-66-70 -warn-error +a -bin-annot -safe-string -strict-formats -I utils -I parsing -I typing -I b
ytecomp -I file_formats -I lambda -I middle_end -I middle_end/closure -I middle_end/flambda -I middle_end/flambda/base_types -I asmcomp -I driver -I toplevel  -c typing/typecore.ml -I typing
30.635s  729MB 58.5MB 61.8MB typing/typecore.ml
  01.645s 44.0MB 5.45MB 8.73MB parsing
    01.644s 44.0MB 5.45MB 8.73MB parser
    00.001s ------ ------ ------ other
  11.959s  274MB 31.9MB 40.6MB typing
  01.698s 55.8MB ------ ------ transl
  15.333s  354MB 21.1MB 61.8MB generate
    00.516s 23.4MB ------ ------ cmm
    12.125s  292MB 15.0MB 55.7MB compile_phrases
      00.034s 0.21MB ------ ------ cmm_invariants
      00.647s 24.7MB ------ ------ selection
      00.082s 2.59MB ------ ------ polling
      00.070s 3.00MB ------ ------ comballoc
      00.396s 15.3MB ------ ------ cse
      00.992s 24.7MB ------ ------ liveness
      00.206s 7.19MB ------ ------ deadcode
      01.102s 31.5MB ------ ------ spill
      00.329s 6.00MB ------ ------ split
      06.924s  161MB 15.0MB 55.7MB regalloc
      00.074s 2.90MB ------ ------ linearize
      00.033s 0.19MB ------ ------ scheduling
      00.680s 7.20MB ------ ------ emit
      00.556s 4.88MB ------ ------ other
    01.332s ------ ------ ------ assemble
    01.361s 38.7MB 6.10MB 61.8MB other
00.409s 6.02MB 3.27MB ------ other

@jhjourdan
Copy link
Contributor

Thinking out loud: we could have a module-level attribute to ask the compiler to use linscan for this module. (We could have a function-level attribute as well, but it is more work to implement and it does not work for the module-initialization code, which is often the culprit of bad regalloc behavior.) We would use the attribute on modules that are known to not play that well with the default allocator.

Since performance is rarely a problem for the module initialization code, we could also decide to use linscan by default for module initialization, and the standard register allocator for functions.

@gasche
Copy link
Member

gasche commented Aug 31, 2021

Since performance is rarely a problem for the module initialization code, we could also decide to use linscan by default for module initialization, and the standard register allocator for functions.

Why not, but (1) we don't know that module-initialization is causing the issue here, and (2) I'm uneasy about a default that uses two different register allocators for all programs, that sounds like doubling the complexity budget and bug surface for this part of the compiler.

(Now that some people have decent benchmark infrastructures for OCaml programs, it would be interesting to get some objective performance numbers on the overhead of linscan.)

@xavierleroy
Copy link
Contributor

My guess is that this happens because RISC-V has more allocatable registers than amd64, and the register allocator's complexity depends on the number of available registers in a non-linear way, but I haven't verified that.

You're correct that the graph coloring register allocator has a number of non-linear behaviors... It is definitely the case that RISC-V has more allocatable registers than AMD64. However, ARM64 is very comparable to RISC-V in this department, and when I compile main_args.ml on a Mac M1, regalloc takes 6 times longer than typing, still far from 55 times as in your RISC-V measurements.

2.126s driver/main_args.ml
  0.033s parsing
    0.033s parser
  0.216s typing
  0.029s transl
  1.848s generate
    0.004s cmm
    1.682s compile_phrases
      0.063s selection
      0.003s polling
      0.001s comballoc
      0.161s cse
      0.029s liveness
      0.007s deadcode
      0.054s spill
      0.010s split
      1.318s regalloc
      0.001s linearize
      0.024s emit
      0.009s other
    0.131s assemble
    0.030s other
0.051s other

So, there's a performance bug very specific to RISC-V that must be investigated and understood. Only after we can discuss changes.

@nojb
Copy link
Contributor

nojb commented Sep 5, 2021

So, there's a performance bug very specific to RISC-V that must be investigated and understood. Only after we can discuss changes.

I investigated a bit: the main issue seems to be that under RISC-V the liveness information for camlMain_args__entry is much bigger than for the other architectures (6x). This causes the register allocator to work much harder than expected. Still no idea about the why for this difference.

@nojb
Copy link
Contributor

nojb commented Sep 5, 2021

So, there's a performance bug very specific to RISC-V that must be investigated and understood. Only after we can discuss changes.

I investigated a bit: the main issue seems to be that under RISC-V the liveness information for camlMain_args__entry is much bigger than for the other architectures (6x). This causes the register allocator to work much harder than expected. Still no idea about the why for this difference.

The problem is that CSE was being applied to integer constants outside the range of immediate operands which was significantly lengthening the life of temporaries holding such constants.

See #10608 for the fix.

@edwintorok
Copy link
Contributor Author

Thanks a lot for tracking this down, it is indeed a lot faster now on native RISC-V too:

16.310s driver/main_args.ml
  00.126s parsing
    00.125s parser
  01.421s typing
  00.149s transl
  14.615s generate
    00.059s cmm
    13.573s compile_phrases
      00.004s cmm_invariants
      00.353s selection
      00.017s comballoc
      00.900s cse
      00.239s liveness
      00.091s deadcode
      00.378s spill
      00.104s split
      11.107s regalloc
      00.037s linearize
      00.074s scheduling
      00.186s emit
      00.084s other
    00.612s assemble
    00.370s other
00.129s other

@xavierleroy
Copy link
Contributor

The problem is that CSE was being applied to integer constants outside the range of immediate operands which was significantly lengthening the life of temporaries holding such constants.

For information: the same issue occurs on ARM 32 bits, with the same consequences (register allocation takes a long time on main_args.ml)..

@gasche
Copy link
Member

gasche commented Sep 6, 2021

If I understand correctly, if #10608 was to be approved as a correct fix and merged, we could close the present issue.

xavierleroy added a commit that referenced this pull request Sep 9, 2021
Never do CSE for constants that fit in 32 bits signed.

Do it only for constants that do not fit, and only on some 64-bit targets
where the code generated for Iconst_int can be costly (ARM64, POWER, RISC-V).

Fixes: #10591
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants