-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
There are not enough polls during non-tail recursive calls #11580
Comments
I'm confused. Your function allocates in the "recursive" case, isn't this just as good as having a poll point? |
Ah, the problem is that the first allocation only occurs after |
It allocates after reaching the last call of the recursion. (I chose this example on purpose!) With TRMC it is different of course. Also for an example that does not poll when returning from the call chain, replace "()::" with "1+". |
Note that |
In fact with let rec loop n =
if n = 0 then ()
else loop (n - 1); () (* not a tail call *) you have the latency issue both on repeated function calls (solved by a clever stack check) and on repeated function returns. If you want to also solve this latter issue, it may help to push fake frames on the call stack indeed. |
All this is known already. If you look at the bytecode interpreter, you'll see that it polls for pending asynchronous actions at every call and at every return. For ocamlopt-generated code, this would incur prohibitive overhead. As to polling in Finally, the current poll insertion strategy was designed to prevent infinite loops that don't poll, or in other words to ensure that polling always eventually takes place. Bounding the latency of polls was not an objective, and I'm not sure we want to make it into an objective. |
I really didn't follow anything about the poll point story, so sorry for the native question, but : if bounding the latency is a non-goal, why are poll points inserted in for-loops? I assume that "eventually" means a bit more than just "eventually", and more something like "reasonnably soon, but with no strict guarantee". |
What is new here is the parallel GC whose latency depends on the latency of polling. Otherwise this was even known before the insertion of new polling points indeed. Specifically, do you believe that some of the strategies I proposed might incur prohibitive overhead?
Since |
The compiler is imperfect but not malicious. |
Note (for context): with the Stop-the-World design the pauses affect more than latency, they also hurt the GC throughput if other domains are waiting idle). The vibe I get from Xavier's answers is that this is a low-priority concern: we don't know of an actual program that is affected by this issue, in part because all OCaml 4 code was written with a relatively low stack limit in mind. So we should maybe not rush to implement something complex when the problem has not manifested itself in practice. Personally I agree with Guillaume that this is a bug: I understand that poll points were designed to guarantee timely service of cross-domain interruptions, and currently call/return work breaks this guarantee. That does not mean that it's a high-priority bug of course, but I think it is a guarantee worth preserving. (On the reasoning line "in practice this is not an issue", do we know of real-world Multicore programs where insertion of poll points is necessary to avoid hurtful pauses? How far could we have gone without adding poll points at all?) |
A non-blocking work-stealing scheduler might do something morally similar to this: let work_available = Atomic.make false
...
let rec spin_wait_for_work () =
if Atomic.get work_available then ()
else (Domain.cpu_relax (); spin_wait_for_work ())
|
At https://discuss.ocaml.org/t/understanding-performance-of-a-multi-core-program/10703/10 there is another example: let rec fib n = if n <= 2 then 1 else fib (n - 1) + fib (n - 2) As @lthls explains:
(I tried to build such as example, but apparently I lacked imagination.) |
To avoid adding polling points and to preserve code size, the proposed solution was to force a check every N words of allocated stack for N large (e.g. >1KB). However this is not enough since the example suggests that even with small amounts of stack, a recursive function can take a long time without polling. There is another solution that avoids inserting new polling points with the current polling mechanism. Currently the function prologue is comparing the stack pointer to
The consequence is a spurious polling but those are always harmless. This does not solve the issue of the absence of polling when returning, but this is not a problem in the case of small call chains. I mostly see how to implement a fix along these lines, but I would need some guidance for making
|
This sounds like a very nice solution to me.
I think the key idea is that if the call chains are assumed bounded, then there must be an approximate balance of calls and returns in long sequences of poll-free instructions, so in particular checking only on calls (or only on returns) is enough. (There may still be something we can do about the unbounded case, but this can be discussed later.) Regarding extra_params: these questions are not for me to answer, but at leat I digged up the code that handles these things, which was non-obvious to me. It may be of interest for others, so here it is. There are three different places that "add stuff at the beginning of a function code", in order of compilation passes:
Line 1185 in 9d5c55c
it is also in charge of inserting a "prologue poll" if necessary (as decided by the poll-insertion logic: Lines 1186 to 1193 in 9d5c55c
(so it would probably be easy to implement @gadmm's second proposed strategy at this level, by forcing the poll inserting if the extra_params are not empty)
So: currently we do call (I wonder if another strategy may be to do things the other way around, by having the |
Thanks, this helps, you dug deeper than I managed. In the meanwhile I have been running Ackermann(4,2) for about four hours (in a version tweaked so that it does not poll) and it is not yet close to triggering a stack overflow (it is currently at about 65MB of stack usage). |
So far this looks like it will require me to learn about the details several compiler passes before I see how to go about it. I could try if someone who knows the code sees a solution and guides me through it, or I can leave the implementation of this idea to someone else. |
I looked at this issue again. Here is a summary of what I understand so far:
Finally, reading bits of the Polling source, I wonder if there wouldn't be an easier approach to try first. Lines 292 to 297 in acbffb5
Lines 97 to 104 in acbffb5
Simple proposal: could we weaken the analysis here and decide to insert a prologue poll not just if there exists "a poll-free path to a forward tail-recursive call", but just "a poll-free path to a forward call"? I believe that this would solve the latency issues due to "long call chains", and it is a fairly simple change to implement (unlike the (Then the idea of |
We could, but I'm afraid the extra polls thus inserted would be expensive. Once again, please don't kill the performance of my programs to satisfy your idea of "enough" polls. |
Perhaps this is a naïve suggestion, but could we use a simple global per-domain mutable integer value to aid in polling? |
This is already what a "poll" is, it is a dynamic check, that sometimes does more stuff. The problem with inserting polls (in your proposal, those "checks that the counter is 0" count as polls) is that they increase code size and hurt performance slightly. The compiler tries to insert as many polls as necessary, but not more than that, to avoid this code-size impact. Currently it does not insert enough, and we are discussing acceptable ways to poll more without inserting too much extra checks. |
I disassembled the following function recently, looking at polling behaviour:
Here is the result in OCaml 5 (OCaml 4.14 is similar).
As you can see, there is no polling during the recursive calls.
One explanation can be found in @stedolan's explanation of his polling insertion strategy (#10039 (comment))
In other words, it is expected that this function either terminates at some point or blows up the stack.
This is not good enough for two reasons:
Possible solutions:
caml_call_realloc_stack
(still unbounded: stack grows exponentially)caml_call_realloc_stack
at regular intervals using the strategy described here: [discussion] Avoiding quadratic stack-scanning behavior on very large stacks? #10947 (comment)Related issue: with enough polling points, and once Memprof is back (and correctly tracks fiber allocation), this gives a way to prevent unbounded stack growth using memory limits, i.e. treat stack consumption and heap consumption the same way.
The text was updated successfully, but these errors were encountered: