Safepoints #3

damiendoligez · 2021-03-26T14:52:22Z

This is a first draft of code to compute when to insert polls. It needs to be optimized for two things (and I'm not sure how to do both at once):

this code goes over the same instr repeatedly in case of nested loops (because instr_body calls polls_unconditionally each time
path_approx is not tail-recursive for so it can overflow the stack on long sequences

sadiqj · 2021-03-30T10:50:15Z

Thanks so much for taking a pass at this @damiendoligez . I've got one last PR to close on multicore and then by-midweek I should be back to focusing on safepoints.

Have you had any thoughts about how we'd actually test the implementation is correct? I started pondering an analysis at the end of the Mach passes (before Linear) that would test there was a poll on every path through the function but I end up with something that looks awfully similar to the original placement analysis and I'm not convinced I wouldn't have the same bug in each.

sadiqj · 2021-04-06T17:42:10Z

In path_approx could we cache the approximation for each handler? We have a unique id for each one and since we go through loops outwards-in that should ensure we only process each handler once.

I'm not sure how to approach making path_approx tail recursive though.

damiendoligez · 2021-04-13T09:41:18Z

In path_approx could we cache the approximation for each handler? We have a unique id for each one and since we go through loops outwards-in that should ensure we only process each handler once.

Yes, we need to do that.

I'm not sure how to approach making path_approx tail recursive though.

The most important is to make it tail-rec for sequences. My plan is to make seq_approx tail-call to path_approx by adding an accumulator argument (to be joined with the result) to path_approx.

damiendoligez · 2021-04-13T10:33:07Z

Have you had any thoughts about how we'd actually test the implementation is correct? I started pondering an analysis at the end of the Mach passes (before Linear) that would test there was a poll on every path through the function but I end up with something that looks awfully similar to the original placement analysis and I'm not convinced I wouldn't have the same bug in each.

If we find a way to instrument an executable to count the time (or number of cycles or instructions) between successive calls to alloc/poll, we could just run that on any number of real programs as well as some synthetic tests, then look for outliers in the distribution. Maybe there's a way to do that with perf? We'd have to find a way to ignore the calls to C code though.

sadiqj · 2021-04-14T07:38:44Z

If we find a way to instrument an executable to count the time (or number of cycles or instructions) between successive calls to alloc/poll, we could just run that on any number of real programs as well as some synthetic tests, then look for outliers in the distribution. Maybe there's a way to do that with perf? We'd have to find a way to ignore the calls to C code though.

We can read the retired instructions perf counter from inside an OCaml signal handler that occurs very frequently (maybe via SIGALRM) - those should be serviced much more regularly with safepoints. I'm not quite sure how to ignore calls to C though.

There's two problems with this approach though. First, the polling strategy we're using has some known (but unlikely) situations where it can have long periods between polls but trades that for removing most polls. Consider a long sequence of non-allocating functions that each non-tail recursively call other non-allocating functions right up to the call stack limit. So a correct analysis could still lead to situations where there is a large number of retired instructions between polls.

Second, and the issue I had on the previous analysis, is that we're only seeing the poll placement on IR that the compiler currently generates. I'd really like a way to test that the analysis is correct for the semantics of the Mach IR rather than what the compiler is currently generating.

damiendoligez · 2021-04-24T21:29:51Z

Rebased on latest version of ocaml#10039

damiendoligez · 2021-04-28T15:41:00Z

@xavierleroy You might want to review this.

xavierleroy

I will need more time (and a private discussion with the author) to fully review this code, but what I've read and understood so far looks nice.

I was surprised to see no fixpoint iteration in the recursive handler case. These handlers can encode loops, and dataflow analyses like the one here generally need fixpoint iteration to handle loops.

Speaking of dataflow analyses, I am tempted to write a generic dataflow engine for Mach, and instantiate it here and in other parts of the backend. But maybe not right now.

xavierleroy · 2021-05-02T08:20:09Z

So, I put on my dataflow analysis glasses, and it led me to guess the following.

We have a 3-point lattice Alloc < Return < PRTC. (I don't think the Exit are necessary if we do a classic dataflow analysis.)

We have a backward analysis: the behavior of a basic instruction is a function of the instruction and of the behavior "after" the instruction (= the lub of the behaviors of the successors)

IN(i) = TRANSF(i, OUT(i))
OUT(i) = lub { IN(s) | s successor of i }

The transfer function is

TRANSF(Ialloc | Ipoll, after) = Alloc
TRANSF(Ireturn | Iraise, after) = Return
TRANSF(Itailcall_ind, after) = PRTC
TRANSF(Itailcall_imm f, after) = PRTC or Return depending on function f
TRANSF(anything else, after) = after

Now we solve these equations by a traversal of the Mach code and fixpoint iteration for the recursive handlers, obtaining

the behavior of the function entry point
the behavior of the entry point of every handler.

Then we insert polling points

at function entry point if its behavior is PRTC
at every Iexit n that may be a back edge if the behavior of handler n is not Alloc.

Is this the intended analysis and transformation, or something else?

xavierleroy · 2021-05-02T08:49:49Z

Let's not forget Icheckbound and other operations that can raise:

TRANSF(operation that can raise, after) = lub Return after

sadiqj · 2021-05-03T17:44:15Z

I was surprised to see no fixpoint iteration in the recursive handler case. These handlers can encode loops, and dataflow analyses like the one here generally need fixpoint iteration to handle loops.

I've been pondering where an iterative solution would differ from the current analysis. For a recursive handler exiting to itself I don't think there is a difference, I'm less sure about mutually recursive handlers though.

sadiqj · 2021-05-03T18:29:56Z

Is this the intended analysis and transformation, or something else?

I think this covers it, yes.

xavierleroy · 2021-05-04T13:05:11Z

@damiendoligez and I re-discussed at the whiteboard. My dataflow vision was wrong in that it didn't correctly flag infinite loops without polling points, something that @damiendoligez's code tried to represent as Exit Intset.empty. Also, we agreed that the Return abstract value is not needed. In the end, we obtained the following alternate specification.

To every program point P we associate two Booleans b1, b2.

b1 is true if every path starting from P goes through an Ialloc, Ipoll, Ireturn, Iraise, Itailcall_ind or Itailcall_imm instruction.

b1 is false, therefore, if starting from P we can loop infinitely without crossing an Ialloc or Ipoll instruction.

b2 is true if there exists a path from P to a Potentially Recursive Tail Call (an Itailcall_ind or Itailcall_imm to a "forward function") that does not go through an Ialloc or Ipoll instruction.

b2 is false, therefore, if starting from P we always poll (via Ialloc or Ipoll) before doing a PRTC.

We use b1 to insert a poll on every back-edge Iexit n to a recursive handler n that has b1 = false.

We use b2 to insert a poll in the function prologue if the function body has b2 = true.

To compute b1, we do a backward dataflow analysis starting from false, using && (Boolean "and") as the join operator, and with the following transfer function:

TRANSF1(Ialloc | Ipoll | Itailcall_ind | Itailcall_imm _ | Ireturn | Iraise, x) = true
TRANSF1(all other operations, x) = x

The analysis of loops (recursive handlers) can stop at any time, computing an approximate but semantically correct result.

To compute b2, we do a backward dataflow analysis starting from false, using || (Boolean "or") as the join operator, and with the following transfer function:

TRANSF2(Ialloc | Ipoll, x) = false
TRANSF2(Itailcall_ind, x) = true
TRANSF2(Itailcall_imm f, x) = f is a forward function
TRANSF2(all other operations, x) = x

The analysis of loops must reach a fixed point, otherwise the result is not semantically correct.

The two Booleans b1, b2 can be computed simultaneously or via separate static analyses.

xavierleroy · 2021-05-05T13:41:40Z

I know this is getting ridiculous, but I submitted a PR against the PR which itself is a PR: damiendoligez#6

damiendoligez · 2021-05-06T14:39:02Z

@xavierleroy I've updated this PR with your code and some changes. There is still a segfault in the test that I need to figure out.

sadiqj · 2021-05-06T14:44:11Z

I've also rebased the main PR against trunk. There's been some code motion in the emitters since the last rebase.

@damiendoligez

Following the dataflow approach discussed with @damiendoligez.

…isions during the analysis.

xavierleroy · 2021-05-06T16:27:10Z

@xavierleroy I've updated this PR with your code and some changes.

Thanks! Now I can close my ridiculous PR.

We still need to resurrect your clever poll insertion strategy for recursive handlers, which I broke without really realizing.

damiendoligez · 2021-05-07T07:40:13Z

We still need to resurrect your clever poll insertion strategy for recursive handlers, which I broke without really realizing.

I don't think it's needed anymore: with the dataflow analysis and insertion at the top, a poll is only inserted inside a nonallocating loop, so it can't be dominated by an allocation.

xavierleroy · 2021-05-07T08:17:18Z

Is there no performance penalty with polling at the beginning of a recursive handler rather than on the back-edge? ocaml#10039 assumes polling on back edges, and adds complications to the code emitters so that these polls produce good code. If we poll at beginning, we could perhaps simplify the code emitters.

damiendoligez · 2021-05-07T09:44:28Z

There can't be a difference in performance between this and your version, because one is polling just before the jump and the other is polling just after the jump.

I think the question boils down to: is there a performance penalty for polling at the top of the loop body rather than the bottom? Maybe the difference can be noticed if your loops exit after a small number of rounds. In this case, indeed we should take care of annotating back edges (and not forward jumps to "unsafe" handlers).

Concerning the segfault, maybe you can help me out: I have determined that it comes from a stack misalignment (it gets to 8 modulo 16, breaking a movapdinstruction later within the GC) when calling caml_call_gc from a polling point. I'll try to understand what's going on but it'll probably be obvious to you.
(edit: see my latest findings here: ocaml#10039 (comment))

xavierleroy · 2021-05-07T15:11:44Z

I think the question boils down to: is there a performance penalty for polling at the top of the loop body rather than the bottom? Maybe the difference can be noticed if your loops exit after a small number of rounds. In this case, indeed we should take care of annotating back edges (and not forward jumps to "unsafe" handlers).

@stedolan pointed out some odd behaviors of branches in recent Intel processors, so I thought that maybe poll-at-top and poll-at-bottom could make a difference here. But if there is no such difference, poll-at-top could simplify the code emitters (no more special case for poll followed by jump), so let's go for poll-at-top.

I'm not worried about loops that exit after a small number of rounds. The general wisdom is that it's not worth optimizing for this case.

sadiqj · 2021-05-07T16:07:40Z

I've removed the original comment, I see it was discussed earlier on. I'm losing track of the PR conversations.

The only cost I see in moving from polling-at-bottom to polling-at-top is an additional not-taken branch, which is folded in ocaml#10039 (at the cost of a bit more complexity in the emitters).

xavierleroy

Hold the press, we're optimizing the static analyses too much...

xavierleroy · 2021-05-09T08:08:15Z

asmcomp/polling.ml

+  let rec before i end_ exn =
+    match i.desc with
+    | Iend -> end_
+    | Iop (Ialloc _ | Ipoll _ | Itailcall_ind | Itailcall_imm _) -> true


I'm afraid this "shortcut" is not right. By not computing before i.next end_ exn, we omit to analyze loops that occur after the Ialloc instruction. These loops will have the default "false" abstract value, hence a polling point will always be inserted in these loops.

Using a pedestrian dataflow engine that always goes over all the instructions (ocaml#10404) may be a good idea, after all.

xavierleroy · 2021-05-09T09:24:29Z

asmcomp/polling.ml

+    | Icatch (Cmm.Recursive, handlers, body) ->
+        let join = before i.next end_ exn in
+        let update changed (n, h) =
+          let b0 = get_handler n in
+          let b1 = before h join exn in
+          if b1 = b0 then changed else (set_handler n b1; true) in
+        while List.fold_left update false handlers do () done;
+        before body join exn


I wonder if we really need a fixpoint iteration here. We're looking for paths from entry to a PTRC that don't go through an Ialloc or Ipoll. If one such path exists, a loop-free path (one that does not go through a loop back-edge) exists too?

damiendoligez · 2021-05-18T12:28:05Z

This PR is superseded by #4

damiendoligez mentioned this pull request Mar 26, 2021

Safepoints ocaml/ocaml#10039

Merged

sadiqj force-pushed the polling_4.11.2 branch from 72fa48d to 62d4709 Compare April 14, 2021 18:19

damiendoligez force-pushed the safepoints branch from 0d30208 to 5e15aca Compare April 24, 2021 07:59

xavierleroy reviewed Apr 30, 2021

View reviewed changes

sadiqj force-pushed the polling_4.11.2 branch from 9f89143 to e74e207 Compare May 5, 2021 17:45

damiendoligez force-pushed the safepoints branch from 5e15aca to d848d09 Compare May 6, 2021 14:36

xavierleroy and others added 3 commits May 6, 2021 16:49

Reimplement the static analysis that supports poll point insertion

4d2b306

Following the dataflow approach discussed with @damiendoligez.

fix bug in dataflow analysis and modify to make polling insertion dec…

9b57360

…isions during the analysis.

update test

cd71256

damiendoligez force-pushed the safepoints branch from d848d09 to cd71256 Compare May 7, 2021 07:37

xavierleroy reviewed May 9, 2021

View reviewed changes

xavierleroy mentioned this pull request May 16, 2021

Insertion of poll instructions, continued #4

Merged

damiendoligez closed this May 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Safepoints #3

Safepoints #3

damiendoligez commented Mar 26, 2021

sadiqj commented Mar 30, 2021

sadiqj commented Apr 6, 2021 •

edited

damiendoligez commented Apr 13, 2021

damiendoligez commented Apr 13, 2021

sadiqj commented Apr 14, 2021

damiendoligez commented Apr 24, 2021

damiendoligez commented Apr 28, 2021

xavierleroy left a comment

xavierleroy commented May 2, 2021 •

edited

xavierleroy commented May 2, 2021

sadiqj commented May 3, 2021

sadiqj commented May 3, 2021

xavierleroy commented May 4, 2021

xavierleroy commented May 5, 2021

damiendoligez commented May 6, 2021

sadiqj commented May 6, 2021

xavierleroy commented May 6, 2021

damiendoligez commented May 7, 2021

xavierleroy commented May 7, 2021

damiendoligez commented May 7, 2021 •

edited

xavierleroy commented May 7, 2021

sadiqj commented May 7, 2021 •

edited

xavierleroy left a comment

xavierleroy May 9, 2021

xavierleroy May 9, 2021

damiendoligez commented May 18, 2021

Safepoints #3

Safepoints #3

Conversation

damiendoligez commented Mar 26, 2021

sadiqj commented Mar 30, 2021

sadiqj commented Apr 6, 2021 • edited

damiendoligez commented Apr 13, 2021

damiendoligez commented Apr 13, 2021

sadiqj commented Apr 14, 2021

damiendoligez commented Apr 24, 2021

damiendoligez commented Apr 28, 2021

xavierleroy left a comment

Choose a reason for hiding this comment

xavierleroy commented May 2, 2021 • edited

xavierleroy commented May 2, 2021

sadiqj commented May 3, 2021

sadiqj commented May 3, 2021

xavierleroy commented May 4, 2021

xavierleroy commented May 5, 2021

damiendoligez commented May 6, 2021

sadiqj commented May 6, 2021

xavierleroy commented May 6, 2021

damiendoligez commented May 7, 2021

xavierleroy commented May 7, 2021

damiendoligez commented May 7, 2021 • edited

xavierleroy commented May 7, 2021

sadiqj commented May 7, 2021 • edited

xavierleroy left a comment

Choose a reason for hiding this comment

xavierleroy May 9, 2021

Choose a reason for hiding this comment

xavierleroy May 9, 2021

Choose a reason for hiding this comment

damiendoligez commented May 18, 2021

sadiqj commented Apr 6, 2021 •

edited

xavierleroy commented May 2, 2021 •

edited

damiendoligez commented May 7, 2021 •

edited

sadiqj commented May 7, 2021 •

edited