Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safepoints #3

Closed
wants to merge 3 commits into from
Closed

Conversation

damiendoligez
Copy link

This is a first draft of code to compute when to insert polls. It needs to be optimized for two things (and I'm not sure how to do both at once):

  • this code goes over the same instr repeatedly in case of nested loops (because instr_body calls polls_unconditionally each time
  • path_approx is not tail-recursive for so it can overflow the stack on long sequences

@sadiqj
Copy link
Owner

sadiqj commented Mar 30, 2021

Thanks so much for taking a pass at this @damiendoligez . I've got one last PR to close on multicore and then by-midweek I should be back to focusing on safepoints.

Have you had any thoughts about how we'd actually test the implementation is correct? I started pondering an analysis at the end of the Mach passes (before Linear) that would test there was a poll on every path through the function but I end up with something that looks awfully similar to the original placement analysis and I'm not convinced I wouldn't have the same bug in each.

@sadiqj
Copy link
Owner

sadiqj commented Apr 6, 2021

In path_approx could we cache the approximation for each handler? We have a unique id for each one and since we go through loops outwards-in that should ensure we only process each handler once.

I'm not sure how to approach making path_approx tail recursive though.

@damiendoligez
Copy link
Author

In path_approx could we cache the approximation for each handler? We have a unique id for each one and since we go through loops outwards-in that should ensure we only process each handler once.

Yes, we need to do that.

I'm not sure how to approach making path_approx tail recursive though.

The most important is to make it tail-rec for sequences. My plan is to make seq_approx tail-call to path_approx by adding an accumulator argument (to be joined with the result) to path_approx.

@damiendoligez
Copy link
Author

Have you had any thoughts about how we'd actually test the implementation is correct? I started pondering an analysis at the end of the Mach passes (before Linear) that would test there was a poll on every path through the function but I end up with something that looks awfully similar to the original placement analysis and I'm not convinced I wouldn't have the same bug in each.

If we find a way to instrument an executable to count the time (or number of cycles or instructions) between successive calls to alloc/poll, we could just run that on any number of real programs as well as some synthetic tests, then look for outliers in the distribution. Maybe there's a way to do that with perf? We'd have to find a way to ignore the calls to C code though.

@sadiqj
Copy link
Owner

sadiqj commented Apr 14, 2021

If we find a way to instrument an executable to count the time (or number of cycles or instructions) between successive calls to alloc/poll, we could just run that on any number of real programs as well as some synthetic tests, then look for outliers in the distribution. Maybe there's a way to do that with perf? We'd have to find a way to ignore the calls to C code though.

We can read the retired instructions perf counter from inside an OCaml signal handler that occurs very frequently (maybe via SIGALRM) - those should be serviced much more regularly with safepoints. I'm not quite sure how to ignore calls to C though.

There's two problems with this approach though. First, the polling strategy we're using has some known (but unlikely) situations where it can have long periods between polls but trades that for removing most polls. Consider a long sequence of non-allocating functions that each non-tail recursively call other non-allocating functions right up to the call stack limit. So a correct analysis could still lead to situations where there is a large number of retired instructions between polls.

Second, and the issue I had on the previous analysis, is that we're only seeing the poll placement on IR that the compiler currently generates. I'd really like a way to test that the analysis is correct for the semantics of the Mach IR rather than what the compiler is currently generating.

@damiendoligez
Copy link
Author

Rebased on latest version of ocaml#10039

@damiendoligez
Copy link
Author

@xavierleroy You might want to review this.

Copy link

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need more time (and a private discussion with the author) to fully review this code, but what I've read and understood so far looks nice.

I was surprised to see no fixpoint iteration in the recursive handler case. These handlers can encode loops, and dataflow analyses like the one here generally need fixpoint iteration to handle loops.

Speaking of dataflow analyses, I am tempted to write a generic dataflow engine for Mach, and instantiate it here and in other parts of the backend. But maybe not right now.

@xavierleroy
Copy link

xavierleroy commented May 2, 2021

So, I put on my dataflow analysis glasses, and it led me to guess the following.

We have a 3-point lattice Alloc < Return < PRTC. (I don't think the Exit are necessary if we do a classic dataflow analysis.)

We have a backward analysis: the behavior of a basic instruction is a function of the instruction and of the behavior "after" the instruction (= the lub of the behaviors of the successors)

IN(i) = TRANSF(i, OUT(i))
OUT(i) = lub { IN(s) | s successor of i }

The transfer function is

TRANSF(Ialloc | Ipoll, after) = Alloc
TRANSF(Ireturn | Iraise, after) = Return
TRANSF(Itailcall_ind, after) = PRTC
TRANSF(Itailcall_imm f, after) = PRTC or Return depending on function f
TRANSF(anything else, after) = after

Now we solve these equations by a traversal of the Mach code and fixpoint iteration for the recursive handlers, obtaining

  • the behavior of the function entry point
  • the behavior of the entry point of every handler.

Then we insert polling points

  • at function entry point if its behavior is PRTC
  • at every Iexit n that may be a back edge if the behavior of handler n is not Alloc.

Is this the intended analysis and transformation, or something else?

@xavierleroy
Copy link

Let's not forget Icheckbound and other operations that can raise:

TRANSF(operation that can raise, after) = lub Return after

@sadiqj
Copy link
Owner

sadiqj commented May 3, 2021

I was surprised to see no fixpoint iteration in the recursive handler case. These handlers can encode loops, and dataflow analyses like the one here generally need fixpoint iteration to handle loops.

I've been pondering where an iterative solution would differ from the current analysis. For a recursive handler exiting to itself I don't think there is a difference, I'm less sure about mutually recursive handlers though.

@sadiqj
Copy link
Owner

sadiqj commented May 3, 2021

Is this the intended analysis and transformation, or something else?

I think this covers it, yes.

@xavierleroy
Copy link

@damiendoligez and I re-discussed at the whiteboard. My dataflow vision was wrong in that it didn't correctly flag infinite loops without polling points, something that @damiendoligez's code tried to represent as Exit Intset.empty. Also, we agreed that the Return abstract value is not needed. In the end, we obtained the following alternate specification.

To every program point P we associate two Booleans b1, b2.

b1 is true if every path starting from P goes through an Ialloc, Ipoll, Ireturn, Iraise, Itailcall_ind or Itailcall_imm instruction.

b1 is false, therefore, if starting from P we can loop infinitely without crossing an Ialloc or Ipoll instruction.

b2 is true if there exists a path from P to a Potentially Recursive Tail Call (an Itailcall_ind or Itailcall_imm to a "forward function") that does not go through an Ialloc or Ipoll instruction.

b2 is false, therefore, if starting from P we always poll (via Ialloc or Ipoll) before doing a PRTC.

We use b1 to insert a poll on every back-edge Iexit n to a recursive handler n that has b1 = false.

We use b2 to insert a poll in the function prologue if the function body has b2 = true.

To compute b1, we do a backward dataflow analysis starting from false, using && (Boolean "and") as the join operator, and with the following transfer function:

TRANSF1(Ialloc | Ipoll | Itailcall_ind | Itailcall_imm _ | Ireturn | Iraise, x) = true
TRANSF1(all other operations, x) = x

The analysis of loops (recursive handlers) can stop at any time, computing an approximate but semantically correct result.

To compute b2, we do a backward dataflow analysis starting from false, using || (Boolean "or") as the join operator, and with the following transfer function:

TRANSF2(Ialloc | Ipoll, x) = false
TRANSF2(Itailcall_ind, x) = true
TRANSF2(Itailcall_imm f, x) = f is a forward function
TRANSF2(all other operations, x) = x

The analysis of loops must reach a fixed point, otherwise the result is not semantically correct.

The two Booleans b1, b2 can be computed simultaneously or via separate static analyses.

@xavierleroy
Copy link

I know this is getting ridiculous, but I submitted a PR against the PR which itself is a PR: damiendoligez#6

@damiendoligez
Copy link
Author

@xavierleroy I've updated this PR with your code and some changes. There is still a segfault in the test that I need to figure out.

@sadiqj
Copy link
Owner

sadiqj commented May 6, 2021

I've also rebased the main PR against trunk. There's been some code motion in the emitters since the last rebase.

@xavierleroy
Copy link

@xavierleroy I've updated this PR with your code and some changes.

Thanks! Now I can close my ridiculous PR.

We still need to resurrect your clever poll insertion strategy for recursive handlers, which I broke without really realizing.

@damiendoligez
Copy link
Author

We still need to resurrect your clever poll insertion strategy for recursive handlers, which I broke without really realizing.

I don't think it's needed anymore: with the dataflow analysis and insertion at the top, a poll is only inserted inside a nonallocating loop, so it can't be dominated by an allocation.

@xavierleroy
Copy link

Is there no performance penalty with polling at the beginning of a recursive handler rather than on the back-edge? ocaml#10039 assumes polling on back edges, and adds complications to the code emitters so that these polls produce good code. If we poll at beginning, we could perhaps simplify the code emitters.

@damiendoligez
Copy link
Author

damiendoligez commented May 7, 2021

There can't be a difference in performance between this and your version, because one is polling just before the jump and the other is polling just after the jump.

I think the question boils down to: is there a performance penalty for polling at the top of the loop body rather than the bottom? Maybe the difference can be noticed if your loops exit after a small number of rounds. In this case, indeed we should take care of annotating back edges (and not forward jumps to "unsafe" handlers).

Concerning the segfault, maybe you can help me out: I have determined that it comes from a stack misalignment (it gets to 8 modulo 16, breaking a movapdinstruction later within the GC) when calling caml_call_gc from a polling point. I'll try to understand what's going on but it'll probably be obvious to you.
(edit: see my latest findings here: ocaml#10039 (comment))

@xavierleroy
Copy link

I think the question boils down to: is there a performance penalty for polling at the top of the loop body rather than the bottom? Maybe the difference can be noticed if your loops exit after a small number of rounds. In this case, indeed we should take care of annotating back edges (and not forward jumps to "unsafe" handlers).

@stedolan pointed out some odd behaviors of branches in recent Intel processors, so I thought that maybe poll-at-top and poll-at-bottom could make a difference here. But if there is no such difference, poll-at-top could simplify the code emitters (no more special case for poll followed by jump), so let's go for poll-at-top.

I'm not worried about loops that exit after a small number of rounds. The general wisdom is that it's not worth optimizing for this case.

@sadiqj
Copy link
Owner

sadiqj commented May 7, 2021

I've removed the original comment, I see it was discussed earlier on. I'm losing track of the PR conversations.

The only cost I see in moving from polling-at-bottom to polling-at-top is an additional not-taken branch, which is folded in ocaml#10039 (at the cost of a bit more complexity in the emitters).

Copy link

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold the press, we're optimizing the static analyses too much...

let rec before i end_ exn =
match i.desc with
| Iend -> end_
| Iop (Ialloc _ | Ipoll _ | Itailcall_ind | Itailcall_imm _) -> true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this "shortcut" is not right. By not computing before i.next end_ exn, we omit to analyze loops that occur after the Ialloc instruction. These loops will have the default "false" abstract value, hence a polling point will always be inserted in these loops.

Using a pedestrian dataflow engine that always goes over all the instructions (ocaml#10404) may be a good idea, after all.

Comment on lines +180 to +187
| Icatch (Cmm.Recursive, handlers, body) ->
let join = before i.next end_ exn in
let update changed (n, h) =
let b0 = get_handler n in
let b1 = before h join exn in
if b1 = b0 then changed else (set_handler n b1; true) in
while List.fold_left update false handlers do () done;
before body join exn

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we really need a fixpoint iteration here. We're looking for paths from entry to a PTRC that don't go through an Ialloc or Ipoll. If one such path exists, a loop-free path (one that does not go through a loop back-edge) exists too?

@damiendoligez
Copy link
Author

This PR is superseded by #4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants