Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There are not enough polls during non-tail recursive calls #11580

Open
gadmm opened this issue Sep 28, 2022 · 20 comments
Open

There are not enough polls during non-tail recursive calls #11580

gadmm opened this issue Sep 28, 2022 · 20 comments
Assignees

Comments

@gadmm
Copy link
Contributor

gadmm commented Sep 28, 2022

I disassembled the following function recently, looking at polling behaviour:

let rec f n = if n = 0 then [] else () :: f (n - 1)

Here is the result in OCaml 5 (OCaml 4.14 is similar).

Dump of assembler code for function camlRec__f_268:
   0x000055555556ae90 <+0>:	lea    r10,[rsp-0x148]
   0x000055555556ae98 <+8>:	cmp    r10,QWORD PTR [r14+0x20]
   0x000055555556ae9c <+12>:	jb     0x55555556aeec <camlRec__f_268+92>   ; check stack capacity
   0x000055555556ae9e <+14>:	sub    rsp,0x8
   0x000055555556aea2 <+18>:	cmp    rax,0x1
   0x000055555556aea6 <+22>:	jne    0x55555556aeb4 <camlRec__f_268+36>   ; if n = 0 ...
   0x000055555556aea8 <+24>:	mov    eax,0x1
   0x000055555556aead <+29>:	add    rsp,0x8
   0x000055555556aeb1 <+33>:	ret    
   0x000055555556aeb2 <+34>:	xchg   ax,ax
   0x000055555556aeb4 <+36>:	add    rax,0xfffffffffffffffe
   0x000055555556aeb8 <+40>:	call   0x55555556ae90 <camlRec__f_268>      ; f (n-1)
   0x000055555556aebd <+45>:	sub    r15,0x18
   0x000055555556aec1 <+49>:	cmp    r15,QWORD PTR [r14]
   0x000055555556aec4 <+52>:	jb     0x55555556aee5 <camlRec__f_268+85>
   0x000055555556aec6 <+54>:	lea    rbx,[r15+0x8]
   0x000055555556aeca <+58>:	mov    QWORD PTR [rbx-0x8],0x800
   0x000055555556aed2 <+66>:	mov    QWORD PTR [rbx],0x1
   0x000055555556aed9 <+73>:	mov    QWORD PTR [rbx+0x8],rax
   0x000055555556aedd <+77>:	mov    rax,rbx
   0x000055555556aee0 <+80>:	add    rsp,0x8
   0x000055555556aee4 <+84>:	ret    
   0x000055555556aee5 <+85>:	call   0x55555558d84c <caml_call_gc>
   0x000055555556aeea <+90>:	jmp    0x55555556aec6 <camlRec__f_268+54>
   0x000055555556aeec <+92>:	push   0x22
   0x000055555556aeee <+94>:	call   0x55555558d5b0 <caml_call_realloc_stack>
   0x000055555556aef3 <+99>:	pop    r10
   0x000055555556aef5 <+101>:	jmp    0x55555556ae9e <camlRec__f_268+14>

As you can see, there is no polling during the recursive calls.

One explanation can be found in @stedolan's explanation of his polling insertion strategy (#10039 (comment))

[...] Also, in such an infinite sequence of function calls, at most finitely many of them can be non-tail calls. (If there are infinitely many non-tail calls, then the program soon terminates with a stack overflow).

In other words, it is expected that this function either terminates at some point or blows up the stack.

This is not good enough for two reasons:

  • There is no bound on the latency of reacting to inter-domain interrupts (e.g. minor GC). (Compare to polling at every iteration of a for loop, for instance.)
  • OCaml multicore makes it easy and realistic to run without a stack limit or with very large stacks, leading to non-interruptible allocating loops (which also deadlock the other domains).

Possible solutions:

  1. Add more polling points (likely expensive in terms of code size)
  2. Poll inside caml_call_realloc_stack (still unbounded: stack grows exponentially)
  3. Idem 2. but in addition arrange that one calls caml_call_realloc_stack at regular intervals using the strategy described here: [discussion] Avoiding quadratic stack-scanning behavior on very large stacks? #10947 (comment)
  4. Idem 3., but also do checks more often when going "downwards" the stack. @gasche alludes to the possiblity of such a strategy here: "[...] we can add a limit on contiguous stack size: when trying to grow the stack above that limit, we add a new stack instead.". (But the details are less clear to me than 3.)

Related issue: with enough polling points, and once Memprof is back (and correctly tracks fiber allocation), this gives a way to prevent unbounded stack growth using memory limits, i.e. treat stack consumption and heap consumption the same way.

@gasche
Copy link
Member

gasche commented Sep 28, 2022

I'm confused. Your function allocates in the "recursive" case, isn't this just as good as having a poll point?

@gasche
Copy link
Member

gasche commented Sep 28, 2022

Ah, the problem is that the first allocation only occurs after n non-tail-calls. Sorry for the naive question above. Indeed the first idea is to tweak the stack check to ensure regular polls.

@gadmm
Copy link
Contributor Author

gadmm commented Sep 28, 2022

It allocates after reaching the last call of the recursion. (I chose this example on purpose!) With TRMC it is different of course. Also for an example that does not poll when returning from the call chain, replace "()::" with "1+".

@gadmm
Copy link
Contributor Author

gadmm commented Sep 28, 2022

Note that caml_call_realloc_stack does not even poll currently. This is the first obvious step.

@gasche
Copy link
Member

gasche commented Sep 28, 2022

In fact with

let rec loop n =
  if n = 0 then ()
  else loop (n - 1); () (* not a tail call *)

you have the latency issue both on repeated function calls (solved by a clever stack check) and on repeated function returns. If you want to also solve this latter issue, it may help to push fake frames on the call stack indeed.

@xavierleroy
Copy link
Contributor

All this is known already. If you look at the bytecode interpreter, you'll see that it polls for pending asynchronous actions at every call and at every return. For ocamlopt-generated code, this would incur prohibitive overhead.

As to polling in caml_call_realloc_stack, it's not clear to me that the shape of the stack at that point can be adequately described by a frame descriptor. We may need to extend frame descriptors so that they can describe roots residing in the "extra params area" of the domain state, for example.

Finally, the current poll insertion strategy was designed to prevent infinite loops that don't poll, or in other words to ensure that polling always eventually takes place. Bounding the latency of polls was not an objective, and I'm not sure we want to make it into an objective.

@alainfrisch
Copy link
Contributor

Finally, the current poll insertion strategy was designed to prevent infinite loops that don't poll, or in other words to ensure that polling always eventually takes place. Bounding the latency of polls was not an objective, and I'm not sure we want to make it into an objective.

I really didn't follow anything about the poll point story, so sorry for the native question, but : if bounding the latency is a non-goal, why are poll points inserted in for-loops? I assume that "eventually" means a bit more than just "eventually", and more something like "reasonnably soon, but with no strict guarantee".

@gadmm
Copy link
Contributor Author

gadmm commented Sep 28, 2022

All this is known already. If you look at the bytecode interpreter, you'll see that it polls for pending asynchronous actions at every call and at every return. For ocamlopt-generated code, this would incur prohibitive overhead.

What is new here is the parallel GC whose latency depends on the latency of polling. Otherwise this was even known before the insertion of new polling points indeed.

Specifically, do you believe that some of the strategies I proposed might incur prohibitive overhead?

As to polling in caml_call_realloc_stack, it's not clear to me that the shape of the stack at that point can be adequately described by a frame descriptor. We may need to extend frame descriptors so that they can describe roots residing in the "extra params area" of the domain state, for example.

Since Caml_state is directly accessible from C, it might be simpler to let caml_try_realloc_stack deal with those (e.g. making temporary copies as local roots whenever there is a pending action). The only information missing currently is how many of them are live. It might be cheap-enough to update this information in the few callers and callees that use these extra params (either using an additional "length" field reset to 0 by the callee, or by treating the array as NULL-terminated and which the callee truncates).

@xavierleroy
Copy link
Contributor

f bounding the latency is a non-goal, why are poll points inserted in for-loops?

The compiler is imperfect but not malicious. for i = min_int to max_int takes forever on a 64-bit machine but a chain of function calls will stop or overflow the stack in one second or so.

@gasche
Copy link
Member

gasche commented Sep 29, 2022

Note (for context): with the Stop-the-World design the pauses affect more than latency, they also hurt the GC throughput if other domains are waiting idle).

The vibe I get from Xavier's answers is that this is a low-priority concern: we don't know of an actual program that is affected by this issue, in part because all OCaml 4 code was written with a relatively low stack limit in mind. So we should maybe not rush to implement something complex when the problem has not manifested itself in practice.

Personally I agree with Guillaume that this is a bug: I understand that poll points were designed to guarantee timely service of cross-domain interruptions, and currently call/return work breaks this guarantee. That does not mean that it's a high-priority bug of course, but I think it is a guarantee worth preserving.

(On the reasoning line "in practice this is not an issue", do we know of real-world Multicore programs where insertion of poll points is necessary to avoid hurtful pauses? How far could we have gone without adding poll points at all?)

@kayceesrk
Copy link
Contributor

(On the reasoning line "in practice this is not an issue", do we know of real-world Multicore programs where insertion of poll points is necessary to avoid hurtful pauses? How far could we have gone without adding poll points at all?)

A non-blocking work-stealing scheduler might do something morally similar to this:

let work_available = Atomic.make false
...
let rec spin_wait_for_work () = 
  if Atomic.get work_available then ()
  else (Domain.cpu_relax (); spin_wait_for_work ())

spin_wait_for_work does not perform any allocations. Without poll points, this code would cause a deadlock if the domain that may produce the work triggers a stop-the-world section for a minor GC.

@gadmm
Copy link
Contributor Author

gadmm commented Oct 27, 2022

At https://discuss.ocaml.org/t/understanding-performance-of-a-multi-core-program/10703/10 there is another example:

let rec fib n = if n <= 2 then 1 else fib (n - 1) + fib (n - 2)

As @lthls explains:

And your case shows that it’s dangerous, because with a fixed stack length you can still fit an arbitrary number of recursive calls, as long as you have a non-linear call graph.

(I tried to build such as example, but apparently I lacked imagination.)

@gadmm
Copy link
Contributor Author

gadmm commented Nov 3, 2022

To avoid adding polling points and to preserve code size, the proposed solution was to force a check every N words of allocated stack for N large (e.g. >1KB). However this is not enough since the example suggests that even with small amounts of stack, a recursive function can take a long time without polling.

There is another solution that avoids inserting new polling points with the current polling mechanism. Currently the function prologue is comparing the stack pointer to current_stack but this could be changed into a comparison to a new variable stack_limit modelled after young_limit. Interrupting a domain would consist in setting both young_limit and stack_limit atomically to a very high value, and there would be polling inside caml_call_realloc_stack. After polling, the current caml_reset_young_limit() would be replaced with a function caml_reset_limits() that resets both young_limit and stack_limit in a race-free way, which mostly avoids double polling at every interrupt. Remains the case of a race during interrupt with trace:

young_limit₁ = UINTNAT_MAX ∥
                           ∥ caml_reset_limits()
stack_limit₁ = UINTNAT_MAX ∥

The consequence is a spurious polling but those are always harmless.

This does not solve the issue of the absence of polling when returning, but this is not a problem in the case of small call chains.

I mostly see how to implement a fix along these lines, but I would need some guidance for making caml_call_realloc_stack a correct polling location, if possible. I would like to know if there is anything to be mindful about in addition the extra_params. I also would like to know if any of these strategies are realistic: adding suitable frame descriptors +

  • Saving the extra_params inside caml_call_realloc_stack locally as roots. In order to know how many need to be saved, have its length recorded in the frame descriptors of those polling locations. (Note that the extra params need to be saved inside caml_call_realloc_stack, not merely recorded as roots, in order to run async callbacks, which can overwrite the extra params.)
  • Or do not poll inside caml_call_realloc_stack when there are live values in the extra_params. Do poll inside caml_call_realloc_stack when extra_params are not used, and emit a regular polling location at the start of the function when extra_params are in use.

@gasche
Copy link
Member

gasche commented Nov 3, 2022

This sounds like a very nice solution to me.

This does not solve the issue of the absence of polling when returning, but this is not a problem in the case of small call chains.

I think the key idea is that if the call chains are assumed bounded, then there must be an approximate balance of calls and returns in long sequences of poll-free instructions, so in particular checking only on calls (or only on returns) is enough.

(There may still be something we can do about the unbounded case, but this can be discussed later.)

Regarding extra_params: these questions are not for me to answer, but at leat I digged up the code that handles these things, which was non-obvious to me. It may be of interest for others, so here it is. There are three different places that "add stuff at the beginning of a function code", in order of compilation passes:

  1. first Selectgen.emit_fundecl, which contains in particular the logic to move extra_params into normal stack slots in

self#insert_moves env loc_arg rarg;

it is also in charge of inserting a "prologue poll" if necessary (as decided by the poll-insertion logic:

ocaml/asmcomp/selectgen.ml

Lines 1186 to 1193 in 9d5c55c

let polled_body =
if Polling.requires_prologue_poll ~future_funcnames
~fun_name:f.Cmm.fun_name body
then
instr_cons_debug
(Iop(Ipoll { return_label = None })) [||] [||] f.Cmm.fun_dbg body
else
body

(so it would probably be easy to implement @gadmm's second proposed strategy at this level, by forcing the poll inserting if the extra_params are not empty)

  1. then Linearize.fundecl will add a prologue (the Lprologue instruction) which is related to setting up the stack frame for the function. (Leaf functions without spilled temporaries don't need a Lprologue instruction typically.)

  2. finally Emit.fundecl is in charge of emitting the call to caml_call_realloc_stack under certain conditions (decided in Emitaux.preproc_stack_check).

So: currently we do call caml_call_realloc_stack before the stack frame is setup, which itself comes before the extra_params. This makes sense, we check that we have enough stack space before we start using it.

(I wonder if another strategy may be to do things the other way around, by having the caml_realloc_stack call in the function prologue placed after the frame setup and after the extra parameters are moved in stack slots. I'm not sure what that means in terms of stack threshold / stack safety margin (see #11076), but it should do the right thing with the GC, and so it may be simpler to implement than @gadmm's first strategy.)

@gadmm
Copy link
Contributor Author

gadmm commented Nov 4, 2022

Thanks, this helps, you dug deeper than I managed. In the meanwhile I have been running Ackermann(4,2) for about four hours (in a version tweaked so that it does not poll) and it is not yet close to triggering a stack overflow (it is currently at about 65MB of stack usage).

@gadmm
Copy link
Contributor Author

gadmm commented Nov 4, 2022

  • Could polling using stack_limit be used to avoid polling at the start of the function in some situations (and reduce code size)? Unfortunately the logic for adding a poll is in Selectgen but the logic for calling call_realloc_stack is later in Emit so this looks like some work to implement.
  • If inserting poll in Selectgen is necessary for correctness for the GC, would this also be the case for polling inside caml_realloc_stack, and thus should some of the stack check logic be moved to Selectgen?

So far this looks like it will require me to learn about the details several compiler passes before I see how to go about it. I could try if someone who knows the code sees a solution and guides me through it, or I can leave the implementation of this idea to someone else.

@gasche gasche self-assigned this Mar 3, 2023
@gasche
Copy link
Member

gasche commented Mar 3, 2023

I looked at this issue again. Here is a summary of what I understand so far:

  • There is a real issue of polling latency for functions like fib.
  • @gadmm proposed a fix, the stack_limit trick, that is rather elegant; it's a technique to have caml_realloc_stack check for external notifications, in additions to allocation points and poll points. (It does not work for latency coming from long return chains, but those currently look possibly more academic than the fib example.)
  • Implementing this is made more tricky by independent aspects of the runtime (extra_params in particular).

Finally, reading bits of the Polling source, I wonder if there wouldn't be an easier approach to try first.

ocaml/asmcomp/polling.ml

Lines 292 to 297 in acbffb5

let requires_prologue_poll ~future_funcnames ~fun_name i =
if function_is_assumed_to_never_poll fun_name then false
else
match potentially_recursive_tailcall ~future_funcnames i with
| Might_not_poll -> true
| Always_polls -> false

ocaml/asmcomp/polling.ml

Lines 97 to 104 in acbffb5

"Might_not_poll" means there exists a path from the function entry to a
Potentially Recursive Tail Call (an Itailcall_ind or
Itailcall_imm to a forward function)
that does not go through an Ialloc or Ipoll instruction.
"Always_polls", therefore, means the function always polls (via Ialloc or
Ipoll) before doing a PRTC. This includes the case where it does not
perform any PRTC.

Simple proposal: could we weaken the analysis here and decide to insert a prologue poll not just if there exists "a poll-free path to a forward tail-recursive call", but just "a poll-free path to a forward call"? I believe that this would solve the latency issues due to "long call chains", and it is a fairly simple change to implement (unlike the caml_realloc_stack idea). We could start by checking that it indeed works, and trying to evaluate the code in terms of code size increase compared to the current strategy.

(Then the idea of caml_realloc_stack could be conceived as a general trick to hide the prologue poll inside caml_realloc_stack, as suggested by @gadmm, whenever it is sound/easy to do so (no extra_params lying around).)

@xavierleroy
Copy link
Contributor

could we weaken the analysis here and decide to insert a prologue poll not just if there exists "a poll-free path to a forward tail-recursive call", but just "a poll-free path to a forward call"?

We could, but I'm afraid the extra polls thus inserted would be expensive. Once again, please don't kill the performance of my programs to satisfy your idea of "enough" polls.

@bluddy
Copy link
Contributor

bluddy commented Mar 5, 2023

Perhaps this is a naïve suggestion, but could we use a simple global per-domain mutable integer value to aid in polling?
We could decide that a recursive function will always poll when the counter hits 0. There's no need to check that it's even the same recursive function - the mechanism works statistically. Every recursive function will decrement the counter and test its value, and once it hits 0, we poll and reset the counter to the starting value.

@gasche
Copy link
Member

gasche commented Mar 5, 2023

This is already what a "poll" is, it is a dynamic check, that sometimes does more stuff. The problem with inserting polls (in your proposal, those "checks that the counter is 0" count as polls) is that they increase code size and hurt performance slightly. The compiler tries to insert as many polls as necessary, but not more than that, to avoid this code-size impact. Currently it does not insert enough, and we are discussing acceptable ways to poll more without inserting too much extra checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants