Modifications to the native compilation of exception handlers #1482

lthls · 2017-11-15T17:58:39Z

Overview

This pull request is extracted from the ongoing flambda branch, and contains two main features:

A change to the representation of try-with constructs from Clambda onwards
A change to the compilation of catch handlers (including exception handlers) that aims at improving
performance in the exception case

Removal of Trywith constructs

This pull request replaces the Utrywith construct from Clambda, and replaces it with a more generalized Ucatch construct, and Upushtrap and Upoptrap to denote entering and exiting a try-with body.

The pushtrap/poptrap approach was already used late in the compilation chain, the reason for trying to push it as early as Clambda is that the ongoing work on flambda has shown the necessity to break the trywith blocks to do some optimizations, and reconstructing trywith blocks at the end of the middle-end is not always possible.

It would have been feasible to retain both the trywith and traps based constructs, but we feel this would only lead to more maintaining load (as some paths in the backend are only taken when flambda is enabled or disabled), and translation of a trywith construct to catch and traps is fairly straightforward (with a special case for exit/staticfail from within a trywith block to outside it; the logic that was present in the Iexit case in linearize.ml has been moved earlier to compensate).

To address the safety problems that such a change implies, three invariants checking modules have been added:

Cmm_invariants checks a few continuations-related invariants (on both catch/exit continuations and trap continuations) on the cmm representation. This was already proposed in a separate pull request (Cmm invariants #1400), but this pull request extends the invariants with trap-related ones and makes the invariant checks mandatory.
Trap_analysis annotates mach instructions with trap stack information. Since no pushtrap/poptrap instructions will be added by the backend, it also checks consistency on trap stacks at exit points. This should address the problems noted in an earlier discussion of the feature about the exception edges of the control flow graph not being explicit enough.
Linear_invariants checks again consistency of trap depths on the linear representation, in particular that extra exit instructions added during the transformation from mach are consistent with the rest of the program.

Changes to the compilation of catch constructs

When compiling a catch handler to linear instructions, the previous choice was to put the handler(s) first, then the actual body, with an extra jump statement to the body code added before the handlers.
This pull request inverts the order, so writes the body first, then the handlers, and of course adds an extra jump at the end of the body to the following code.

This was motivated by a suspicion that the use of call and return x86 instructions to respectively branch into the body and go back to the handler interfered with the prediction of returns that was done in the processor, causing branch misses. With this patch, going to the body is a simple fallthrough, going to the handler in the raise case is a pop followed by a jmp, and additional work is done at the beginning of the body to compute and store the address of the handler on the stack.

To give an idea of how much time this could save, we ran the following micro-benchmark:

let test_gen () =
  let count = ref 0 in
  fun () ->
    incr count;
    !count land 1 = 0

let test1 = test_gen ()

let test2 = test_gen ()

let do_it n =
  let f () =
    try if test1 () then raise Not_found else ()
    with Not_found when test2 () -> ()
  in
  for i = 1 to n do
    try f () with Not_found -> ()
  done

let _ =
  let default = 100_000 in
  let n =
    if Array.length Sys.argv > 1
    then
      try int_of_string Sys.argv.(1)
      with _ -> default
    else default
  in
  do_it n

through the following commands:

$ ./ocamlopt.opt -o microbench microbench.ml
$ perf stat ./microbench 1000000000

and we got the following results:
On trunk:


 Performance counter stats for './microbench 1000000000':

       9936.470484      task-clock (msec)         #    1.000 CPUs utilized          
                35      context-switches          #    0.004 K/sec                  
                 3      cpu-migrations            #    0.000 K/sec                  
                98      page-faults               #    0.010 K/sec                  
    34,612,388,085      cycles                    #    3.483 GHz                    
    46,758,631,959      instructions              #    1.35  insn per cycle         
     7,751,627,648      branches                  #  780.119 M/sec                  
       750,012,504      branch-misses             #    9.68% of all branches        

       9.937585757 seconds time elapsed

With this patch:


 Performance counter stats for './microbench 1000000000':

       4475.919299      task-clock (msec)         #    1.000 CPUs utilized          
                12      context-switches          #    0.003 K/sec                  
                 2      cpu-migrations            #    0.000 K/sec                  
                97      page-faults               #    0.022 K/sec                  
    15,382,062,427      cycles                    #    3.437 GHz                    
    49,754,449,689      instructions              #    3.23  insn per cycle         
     6,500,840,769      branches                  # 1452.403 M/sec                  
            21,738      branch-misses             #    0.00% of all branches        

       4.477019523 seconds time elapsed

The huge gains in execution time can be linked to the big difference in branch-misses, with roughly one branch-miss for every raise in the program.

Tests done on more complex programs are of course much less impressive but still show improvement, and more serious benchmarking is being done using the flambda benchmarks infrastructure. Results will be linked to in this thread once they are available.

In addition to the benchmarking, testing of the patched emitters has been done on all of the supported architectures. More precisely, amd64, arm, arm64, and power were tested on real hardware (by running the whole testsuite), i386 was tested on amd64 by configuring the compiler to produce 32-bits code, and s390x was tested through qemu (some of the lib-threads tests did not finish, likely due to the qemu scheduler, but after disabling them other tests passed fine).

About separating the patches

While the two changes were described separately, they are submitted in a single pull request because they correspond to the exceptions-related changes in the flambda branch that are not dependent on the middle-end. We would definitely prefer to merge them together, rather than separately, and we have not tested the impact of one part without the other.

cgaebel · 2017-11-16T15:33:07Z

asmcomp/amd64/emit.mlp

@@ -171,6 +171,13 @@ let emit_Llabel fallthrough lbl =
  if not fallthrough && !fastcode_flag then D.align 4;
  def_label lbl

+let load_label_addr s arg =
+  (* CR mshinwell: this needs more testing *)


Is this patch supposed to still have CRs in it?

This still needs to be checked on the various targets (and target variants) to make sure that the correct instruction sequence is being generated.

xavierleroy · 2017-11-18T17:53:57Z

I'm in favor of changing the linearization of try...with constructs so that no call instruction is used and the body is in fallthrough-unconditional branch position.

I'm absolutely not convinced that the current, structured representation of try...with should be replaced by an unstructured representation based on push-handler and pop-handler. You're losing structure, which is rarely if ever a good thing, and you're not gaining anything as far as I can see.

lthls · 2017-11-21T18:04:11Z

The restrictions of the try-with structure can be shown of the following example:

let x =
  if (condition) then
    begin (before1); 1 end
  else begin (before2); 2 end
in
try
  if x = 1 then (after1) else (after2)
with exn -> (handler)

The version of flambda that is being worked on can propagate the information on x and rewrite this into:

catch
  if (condition) then
    (before1; pushtrap handler_trap; after1)
  else (before2; pushtrap handler_trap; after2);
  poptrap handler_trap
with handler_trap exn -> (handler)

(assuming x is not used anywhere else).
If any of the before pieces of code may raise, there is no way to represent this using a single try-with block.

However, we acknowledge that the use of explicit pushtrap and poptrap instructions is not ideal in otherwise structured intermediate representations.
So one solution we would be willing to try would be to scrap the explicit pushtrap and poptrap instructions that were added by this PR, but allow jump instructions (staticfail and exit) to be annotated with extra trap actions (push or pop). The try-with block would still need to disappear, since implicit and explicit trap manipulation do not mix well, but we could have each catch handler annotated with the exception stack it expects on entry, and the invariant that exception context cannot change except on entering a handler via an annotated jump. Knowing which context you are in becomes trivial: if you are a continuation handler, then your context is the one you are annotated with, otherwise it is the same as your parent sub-expression. Knowing where exception-raising primitives may flow is slightly less trivial, since it involves a look at the top of your exception stack and then a lookup of the corresponding continuation in all englobing handlers (compared to only the nearest one with try-with), but still reasonable.
Translation of try-with into this form is straight-forward:

try (body) with exn -> (handler)

becomes (with {} brackets denoting the exception stack on handlers and the trap action on exits, and ctx denotes the exception stack at the level of the try-with block):

catch_exn
  catch
    catch (exit {push T} 0 [])
    with 0 {T::ctx} [] -> exit {pop T} 1 [(body)]
  with 1 {ctx} [x] -> x
with T {ctx} exn -> (handler)

With the patch that puts body before handler in linearized code, the extra jumps are actually fallthrough and can be removed, so no overhead remains in the assembly.

Would such a solution be more acceptable to you ?

lthls · 2017-12-05T14:43:55Z

I've pushed an updated version that roughly implements the proposal in my previous comment.
The main differences:

I only annotate catch handlers starting from the Mach internal representation (via Trap_analysis).
I annotate Lstaticraise constructors in Lambda, which means this change now also affects bytecode compilation. Unlike its backend counterparts, Lstaticraise is not allowed to push traps (it would interfere with the try-with blocks).

The first choice is a simple matter of convenience: it means less modifications to the types, and doesn't prevent checking consistency earlier if one wishes so.
The second one is more important: the main point of adding annotations (besides giving more freedom to flambda) is to ensure that the backend knows exactly which jumps are local and which require extra instructions to adapt the exception stack. Since Lambda can contain non-local jumps, it makes sense to enforce annotations earlier and transmit them, rather than infer them from the context. This, combined with the consistency checker in Trap_analysis, allowed us to discover the bug reported in MPR 7680 and fixed in #1497.

A few comments on the commits themselves:

be13036 contains the main commit.
abe8908 adds an invariants checking pass on Cmm that was used to catch a few bugs; it was proposed independently in a simpler form in Cmm invariants #1400.
c1395f3 adds a test involving exceptions and dynlink, mostly to ensure that the patch on the power assembly emitter was tested a bit more in depth.
3845d52 shows a few improvements that could be made to bytegen.ml using the annotations.
99efa61 shows how the annotations allow for a cleaner fix for MPR 7680.

We're still not planning to submit the two initial changes independently, but otherwise every commit except the main one (and Changes, I guess) can be dropped if needed.

lthls · 2017-12-13T14:30:14Z

Benchmarks are finally up on the flambda benchmarks site here. You can find a specific comparison of this pull request with the trunk commit it's based on here.
The results show a very small but systematic increase in code size, very little impact on the memory footprint, and on average slightly slower execution times (with a lot of variations between benches).

gretay-js

The changes in amd64,i386,arm64,arm look fine to me. I don't know enough about the other targets to be confident.
I wonder about the choice of temporary registers for use in the code emitted for Lraise in amd64 (r11) and i386 (ebx). Is there any particular reason for choosing these registers specifically, and if so - perhaps it should be documented.

gasche · 2021-04-28T12:27:39Z

A few years later, what is the status of this PR? Are we still interested in going in this direction? (From a distance this sounds interesting, but I can't tell if the new design addresses the criticisms of Xavier on the first proposal.)

Personally I find the discussion hard to read as there was, if I understand correctly, a big change to the design mid-fly. I would be in favor of closing this PR, and asking @lthls to reopen a new PR that presents the final design directly.

lthls · 2021-04-28T12:33:54Z

The final design is in use on the Flambda 2 branch. I also have a branch that should contain only the patches related to the compilation of exceptions, but unless somebody outside the Flambda team is interested in reviewing it now it will get re-submitted when we start upstreaming the relevant Flambda 2 patches.

cgaebel reviewed Nov 16, 2017

View reviewed changes

alainfrisch mentioned this pull request Nov 30, 2017

Fix bad interaction between Simplif and optimized compilation of tuple binding #1497

Merged

lthls added 5 commits December 4, 2017 16:38

Annotate jumps with explicit trap actions

be13036

Add cmm_invariants module to check and document invariants

abe8908

Add native dynlink test with exceptions flowing both ways

c1395f3

Changes

6058906

.merlin: add asmcomp/debug directory

9a580b5

lthls force-pushed the exceptions_backend branch from 2e26e84 to 61b366b Compare December 4, 2017 17:29

lthls added 2 commits December 5, 2017 15:01

Extra: Adapt (mostly refactor) bytecode generation

3845d52

Extra: Rework fix for MPR#7680

99efa61

lthls force-pushed the exceptions_backend branch from 61b366b to 99efa61 Compare December 5, 2017 14:05

Fix arm emitter

ca98803

lthls mentioned this pull request Feb 6, 2019

Linearize: for Trywith, remove the jump/call to the handler #2237

Merged

gretay-js reviewed Feb 11, 2019

View reviewed changes

gretay-js mentioned this pull request Mar 13, 2019

Add pseudo-instruction Ladjust_trap_depth #2322

Merged

This was referenced Mar 14, 2019

Incorrect interaction between Matching.for_let and Simplif.simplify_exits #7680

Closed

cmm miscompile: cross-module inlining causes catch ID collision #7702

Closed

lthls closed this Apr 28, 2021

EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024

[create-pull-request] automated change (ocaml#1482)

76c9f60

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modifications to the native compilation of exception handlers #1482

Modifications to the native compilation of exception handlers #1482

lthls commented Nov 15, 2017

cgaebel Nov 16, 2017

mshinwell Nov 16, 2017

xavierleroy commented Nov 18, 2017

lthls commented Nov 21, 2017

lthls commented Dec 5, 2017 •

edited

lthls commented Dec 13, 2017

gretay-js left a comment

gasche commented Apr 28, 2021

lthls commented Apr 28, 2021

Modifications to the native compilation of exception handlers #1482

Modifications to the native compilation of exception handlers #1482

Conversation

lthls commented Nov 15, 2017

Overview

Removal of Trywith constructs

Changes to the compilation of catch constructs

About separating the patches

cgaebel Nov 16, 2017

Choose a reason for hiding this comment

mshinwell Nov 16, 2017

Choose a reason for hiding this comment

xavierleroy commented Nov 18, 2017

lthls commented Nov 21, 2017

lthls commented Dec 5, 2017 • edited

lthls commented Dec 13, 2017

gretay-js left a comment

Choose a reason for hiding this comment

gasche commented Apr 28, 2021

lthls commented Apr 28, 2021

lthls commented Dec 5, 2017 •

edited