Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modifications to the native compilation of exception handlers #1482

Closed
wants to merge 8 commits into from

Conversation

lthls
Copy link
Contributor

@lthls lthls commented Nov 15, 2017

Overview

This pull request is extracted from the ongoing flambda branch, and contains two main features:

  • A change to the representation of try-with constructs from Clambda onwards
  • A change to the compilation of catch handlers (including exception handlers) that aims at improving
    performance in the exception case

Removal of Trywith constructs

This pull request replaces the Utrywith construct from Clambda, and replaces it with a more generalized Ucatch construct, and Upushtrap and Upoptrap to denote entering and exiting a try-with body.

The pushtrap/poptrap approach was already used late in the compilation chain, the reason for trying to push it as early as Clambda is that the ongoing work on flambda has shown the necessity to break the trywith blocks to do some optimizations, and reconstructing trywith blocks at the end of the middle-end is not always possible.

It would have been feasible to retain both the trywith and traps based constructs, but we feel this would only lead to more maintaining load (as some paths in the backend are only taken when flambda is enabled or disabled), and translation of a trywith construct to catch and traps is fairly straightforward (with a special case for exit/staticfail from within a trywith block to outside it; the logic that was present in the Iexit case in linearize.ml has been moved earlier to compensate).

To address the safety problems that such a change implies, three invariants checking modules have been added:

  • Cmm_invariants checks a few continuations-related invariants (on both catch/exit continuations and trap continuations) on the cmm representation. This was already proposed in a separate pull request (Cmm invariants #1400), but this pull request extends the invariants with trap-related ones and makes the invariant checks mandatory.
  • Trap_analysis annotates mach instructions with trap stack information. Since no pushtrap/poptrap instructions will be added by the backend, it also checks consistency on trap stacks at exit points. This should address the problems noted in an earlier discussion of the feature about the exception edges of the control flow graph not being explicit enough.
  • Linear_invariants checks again consistency of trap depths on the linear representation, in particular that extra exit instructions added during the transformation from mach are consistent with the rest of the program.

Changes to the compilation of catch constructs

When compiling a catch handler to linear instructions, the previous choice was to put the handler(s) first, then the actual body, with an extra jump statement to the body code added before the handlers.
This pull request inverts the order, so writes the body first, then the handlers, and of course adds an extra jump at the end of the body to the following code.

This was motivated by a suspicion that the use of call and return x86 instructions to respectively branch into the body and go back to the handler interfered with the prediction of returns that was done in the processor, causing branch misses. With this patch, going to the body is a simple fallthrough, going to the handler in the raise case is a pop followed by a jmp, and additional work is done at the beginning of the body to compute and store the address of the handler on the stack.

To give an idea of how much time this could save, we ran the following micro-benchmark:

let test_gen () =
  let count = ref 0 in
  fun () ->
    incr count;
    !count land 1 = 0

let test1 = test_gen ()

let test2 = test_gen ()

let do_it n =
  let f () =
    try if test1 () then raise Not_found else ()
    with Not_found when test2 () -> ()
  in
  for i = 1 to n do
    try f () with Not_found -> ()
  done

let _ =
  let default = 100_000 in
  let n =
    if Array.length Sys.argv > 1
    then
      try int_of_string Sys.argv.(1)
      with _ -> default
    else default
  in
  do_it n

through the following commands:

$ ./ocamlopt.opt -o microbench microbench.ml
$ perf stat ./microbench 1000000000

and we got the following results:
On trunk:


 Performance counter stats for './microbench 1000000000':

       9936.470484      task-clock (msec)         #    1.000 CPUs utilized          
                35      context-switches          #    0.004 K/sec                  
                 3      cpu-migrations            #    0.000 K/sec                  
                98      page-faults               #    0.010 K/sec                  
    34,612,388,085      cycles                    #    3.483 GHz                    
    46,758,631,959      instructions              #    1.35  insn per cycle         
     7,751,627,648      branches                  #  780.119 M/sec                  
       750,012,504      branch-misses             #    9.68% of all branches        

       9.937585757 seconds time elapsed

With this patch:


 Performance counter stats for './microbench 1000000000':

       4475.919299      task-clock (msec)         #    1.000 CPUs utilized          
                12      context-switches          #    0.003 K/sec                  
                 2      cpu-migrations            #    0.000 K/sec                  
                97      page-faults               #    0.022 K/sec                  
    15,382,062,427      cycles                    #    3.437 GHz                    
    49,754,449,689      instructions              #    3.23  insn per cycle         
     6,500,840,769      branches                  # 1452.403 M/sec                  
            21,738      branch-misses             #    0.00% of all branches        

       4.477019523 seconds time elapsed

The huge gains in execution time can be linked to the big difference in branch-misses, with roughly one branch-miss for every raise in the program.

Tests done on more complex programs are of course much less impressive but still show improvement, and more serious benchmarking is being done using the flambda benchmarks infrastructure. Results will be linked to in this thread once they are available.

In addition to the benchmarking, testing of the patched emitters has been done on all of the supported architectures. More precisely, amd64, arm, arm64, and power were tested on real hardware (by running the whole testsuite), i386 was tested on amd64 by configuring the compiler to produce 32-bits code, and s390x was tested through qemu (some of the lib-threads tests did not finish, likely due to the qemu scheduler, but after disabling them other tests passed fine).

About separating the patches

While the two changes were described separately, they are submitted in a single pull request because they correspond to the exceptions-related changes in the flambda branch that are not dependent on the middle-end. We would definitely prefer to merge them together, rather than separately, and we have not tested the impact of one part without the other.

@@ -171,6 +171,13 @@ let emit_Llabel fallthrough lbl =
if not fallthrough && !fastcode_flag then D.align 4;
def_label lbl

let load_label_addr s arg =
(* CR mshinwell: this needs more testing *)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this patch supposed to still have CRs in it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs to be checked on the various targets (and target variants) to make sure that the correct instruction sequence is being generated.

@xavierleroy
Copy link
Contributor

I'm in favor of changing the linearization of try...with constructs so that no call instruction is used and the body is in fallthrough-unconditional branch position.

I'm absolutely not convinced that the current, structured representation of try...with should be replaced by an unstructured representation based on push-handler and pop-handler. You're losing structure, which is rarely if ever a good thing, and you're not gaining anything as far as I can see.

@lthls
Copy link
Contributor Author

lthls commented Nov 21, 2017

The restrictions of the try-with structure can be shown of the following example:

let x =
  if (condition) then
    begin (before1); 1 end
  else begin (before2); 2 end
in
try
  if x = 1 then (after1) else (after2)
with exn -> (handler)

The version of flambda that is being worked on can propagate the information on x and rewrite this into:

catch
  if (condition) then
    (before1; pushtrap handler_trap; after1)
  else (before2; pushtrap handler_trap; after2);
  poptrap handler_trap
with handler_trap exn -> (handler)

(assuming x is not used anywhere else).
If any of the before pieces of code may raise, there is no way to represent this using a single try-with block.

However, we acknowledge that the use of explicit pushtrap and poptrap instructions is not ideal in otherwise structured intermediate representations.
So one solution we would be willing to try would be to scrap the explicit pushtrap and poptrap instructions that were added by this PR, but allow jump instructions (staticfail and exit) to be annotated with extra trap actions (push or pop). The try-with block would still need to disappear, since implicit and explicit trap manipulation do not mix well, but we could have each catch handler annotated with the exception stack it expects on entry, and the invariant that exception context cannot change except on entering a handler via an annotated jump. Knowing which context you are in becomes trivial: if you are a continuation handler, then your context is the one you are annotated with, otherwise it is the same as your parent sub-expression. Knowing where exception-raising primitives may flow is slightly less trivial, since it involves a look at the top of your exception stack and then a lookup of the corresponding continuation in all englobing handlers (compared to only the nearest one with try-with), but still reasonable.
Translation of try-with into this form is straight-forward:

try (body) with exn -> (handler)

becomes (with {} brackets denoting the exception stack on handlers and the trap action on exits, and ctx denotes the exception stack at the level of the try-with block):

catch_exn
  catch
    catch (exit {push T} 0 [])
    with 0 {T::ctx} [] -> exit {pop T} 1 [(body)]
  with 1 {ctx} [x] -> x
with T {ctx} exn -> (handler)

With the patch that puts body before handler in linearized code, the extra jumps are actually fallthrough and can be removed, so no overhead remains in the assembly.

Would such a solution be more acceptable to you ?

@lthls
Copy link
Contributor Author

lthls commented Dec 5, 2017

I've pushed an updated version that roughly implements the proposal in my previous comment.
The main differences:

  • I only annotate catch handlers starting from the Mach internal representation (via Trap_analysis).
  • I annotate Lstaticraise constructors in Lambda, which means this change now also affects bytecode compilation. Unlike its backend counterparts, Lstaticraise is not allowed to push traps (it would interfere with the try-with blocks).

The first choice is a simple matter of convenience: it means less modifications to the types, and doesn't prevent checking consistency earlier if one wishes so.
The second one is more important: the main point of adding annotations (besides giving more freedom to flambda) is to ensure that the backend knows exactly which jumps are local and which require extra instructions to adapt the exception stack. Since Lambda can contain non-local jumps, it makes sense to enforce annotations earlier and transmit them, rather than infer them from the context. This, combined with the consistency checker in Trap_analysis, allowed us to discover the bug reported in MPR 7680 and fixed in #1497.

A few comments on the commits themselves:

  • be13036 contains the main commit.
  • abe8908 adds an invariants checking pass on Cmm that was used to catch a few bugs; it was proposed independently in a simpler form in Cmm invariants #1400.
  • c1395f3 adds a test involving exceptions and dynlink, mostly to ensure that the patch on the power assembly emitter was tested a bit more in depth.
  • 3845d52 shows a few improvements that could be made to bytegen.ml using the annotations.
  • 99efa61 shows how the annotations allow for a cleaner fix for MPR 7680.

We're still not planning to submit the two initial changes independently, but otherwise every commit except the main one (and Changes, I guess) can be dropped if needed.

@lthls
Copy link
Contributor Author

lthls commented Dec 13, 2017

Benchmarks are finally up on the flambda benchmarks site here. You can find a specific comparison of this pull request with the trunk commit it's based on here.
The results show a very small but systematic increase in code size, very little impact on the memory footprint, and on average slightly slower execution times (with a lot of variations between benches).

Copy link
Contributor

@gretay-js gretay-js left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in amd64,i386,arm64,arm look fine to me. I don't know enough about the other targets to be confident.
I wonder about the choice of temporary registers for use in the code emitted for Lraise in amd64 (r11) and i386 (ebx). Is there any particular reason for choosing these registers specifically, and if so - perhaps it should be documented.

@gasche
Copy link
Member

gasche commented Apr 28, 2021

A few years later, what is the status of this PR? Are we still interested in going in this direction? (From a distance this sounds interesting, but I can't tell if the new design addresses the criticisms of Xavier on the first proposal.)

Personally I find the discussion hard to read as there was, if I understand correctly, a big change to the design mid-fly. I would be in favor of closing this PR, and asking @lthls to reopen a new PR that presents the final design directly.

@lthls
Copy link
Contributor Author

lthls commented Apr 28, 2021

The final design is in use on the Flambda 2 branch. I also have a branch that should contain only the patches related to the compilation of exceptions, but unless somebody outside the Flambda team is interested in reviewing it now it will get re-submitted when we start upstreaming the relevant Flambda 2 patches.

@lthls lthls closed this Apr 28, 2021
EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants