Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for interrupting another domain #11411

Open
ejgallego opened this issue Jul 7, 2022 · 16 comments
Open

Support for interrupting another domain #11411

ejgallego opened this issue Jul 7, 2022 · 16 comments

Comments

@ejgallego
Copy link

ejgallego commented Jul 7, 2022

Dear OCaml developers,

in our experiments to port Coq and coq-lsp to OCaml 5, it was suprising to me to learn that "true" domain cancellation (which I guess it'd map to pthread_cancel) is not planned for 5.x

See discussion at https://discuss.ocaml.org/t/a-tutorial-on-parallel-programming-in-ocaml-5/9896/9?u=ejgallego

I cannot stress enough how critical this functionality is for a large variety of application domains, most remarkably concurrent proof search, speculative document processing (used for research in ML), and user interfaces such as the upcoming coq-lsp incremental type-checker.

The suggested alternative, which is to instrument user code to check explicitly for an interrupt, is just not feasible (and indeed, we have tried that already).

I understand that getting this feature right is tricky, however I see other systems such as Lean or Isabelle (in PolyML) do manage spawn and kill thread happily, so as a user I'd be willing to have access to such API even if it would come with a burden in terms of using it carefully.

I'd also be happy if the solution is not super portable at first, or would have us writing platform-specific code.

@ejgallego ejgallego changed the title Support for pthread_kill / Domains.kill in OCaml 5.0 Support for pthread_cancel / Domains.cancel in OCaml 5.0 Jul 7, 2022
@gasche
Copy link
Member

gasche commented Jul 7, 2022

In the Discuss thread, @gadmm suggests using asynchronous exceptions, for example his work on Memprof-limits. I agree that this sounds like a better approach than sending signals between threads (which does not fit the OCaml semantics model very well), or even worse terminating a thread uncooperatively.

Supporting asynchronous exceptions in this way is not in scope for 5.0 (for example we won't have Memprof yet), but may definitely happen in 5.x. @gadmm has the interest and expertise, and you have a use-case; it's a match!

@xavierleroy
Copy link
Contributor

xavierleroy commented Jul 8, 2022

Even pthread_cancel will not (by default) stop an uncooperative thread: the thread must reach a cancellation point (e.g. a blocking system call) or poll explicitly using pthread_testcancel... There's an optional "asynchronous cancellation" mode that can be selected, but is so dangerous that nobody uses it.

Bottom line: Pthread-style cancellation is not what you're looking for, nor an appropriate mechanism to build what you're looking for. Asynchronous exceptions might one day do what you need, although many design points remain unclear to me.

@gadmm
Copy link
Contributor

gadmm commented Jul 8, 2022

First, thanks for being vocal about these use-cases. I know about them because we have talked about it together but it is good to have this knowledge and experience in public writing.

If the title is "Support for pthread_cancel" it is not going to work for the reasons explained here and on discuss. If the title is changed back to "pthread_kill", then there is some sense to the question.

First, for cancellation you can still use signals (raising from the OCaml signal handler) and use the sigmask to target the chosen threads in multicore. If you have several threads that you want to cancel at the same time, then for this kind of application you can send your signal repeatedly with some delay until all the targeted threads are cancelled. One advantage of signals over other approaches is that they also interrupt blocking IO operations.

In this context, what pthread_kill brings is the ability to target a specific thread. This will not work with the current implementation because even if the C signal handler runs in the correct thread you have no guarantee that the OCaml signal handler does so (unless you are able to use sigmask, but in this case you do not need pthread_kill). It is a bit of work to bring support for it so a compelling use-case is needed (that goes more into the specific technical details of the pthread_kill advantages compared to other solutions).

If the behaviour regarding blocking IO is not needed, then it would be easier to support a Domain.interrupt/Thread.interrupt function that would run a callback asynchronously in a chosen domain/thread by implementing it inside the runtime without going through signals (it would even be more portable since I know you have issues with signals on Windows). At this point you are close to what memprof-limits proposes.

By the way, memprof-limits was written with the Coq proof assistant implementation in mind, so I am confident that its uses are well-covered once Memprof is back. We have not started experimenting with bringing it to the Coq proof assistant yet because we still need to find someone time to do it, and also because it is useless until support for memprof is widespread enough (this is something that the local Coq devs said would be happy to have, and to my surprise they are even more impatient for the allocation limits than the memory limits).

@ejgallego
Copy link
Author

Thank you very much for the answers, indeed it seems that the way to go is with the solution discussed by @gadmm using async exceptions; I'll try to reword the issue later on (or @gadmm , please feel free to do so yourself).

My reading of what we want here then is "better runtime support for certain kind of async exceptions" ; is that correct?

I propose we remove references to pthread_cancel/kill as they are indeed misleading.

Actually, maybe there is already a way using the C OCaml API where we could implement Domain.signal that would set the particular domain to be interrupted in the next allocation. Is that the case?

Is it written down somewhere in what cases does a domain check for async exceptions ? (I assume in allocation, right?)

I'll be happy to test this method using the coq-lsp project, where we often interrupt Coq (currently sending a signal), actually @gadmm send me over the years a lot of helpful information about async exceptions, but only now I have code (finally) that requires that knowledge, thanks!

Out of curiosity I checked what polyML uses:

    // Set the stack limit pointer to the top of the stack to cause
    // a trap when we next check for stack overflow.

@ejgallego
Copy link
Author

@gadmm we wrote the messages at the same time hehe, please indeed update the title of the issue.

@xavierleroy xavierleroy changed the title Support for pthread_cancel / Domains.cancel in OCaml 5.0 Support for interrupting another domain Jul 8, 2022
@xavierleroy
Copy link
Contributor

I changed the title of this feature wish to something more descriptive and less focused on a particular implementation.

Is it written down somewhere in what cases does a domain check for async exceptions ? (I assume in allocation, right?)

At allocations and at explicit "polling" points introduced by the compiler when it sees code that might loop forever without allocating. The long story is at #10039

Actually, maybe there is already a way using the C OCaml API where we could implement Domain.signal that would set the particular domain to be interrupted in the next allocation. Is that the case?

There's a more general mechanism already in place to support stop-the-world GC and signal handling. What you outline is probably doable, but it won't interrupt domains that are blocked on I/O. For this, actual OS signals are needed, with the caveat that it will not be portable to Windows.

@jonahbeckford
Copy link

There's a more general mechanism already in place to support stop-the-world GC and signal handling. What you outline is probably doable, but it won't interrupt domains that are blocked on I/O. For this, actual OS signals are needed, with the caveat that it will not be portable to Windows.

APC may work on Windows instead of signals: https://docs.microsoft.com/en-us/windows/win32/sync/asynchronous-procedure-calls . I believe APC is already used in

/* Pool of worker threads.
*
* These functions help to manage a pool of worker threads and submit task to
* the pool. It helps to reduce the number of thread creation.
*
* Each worker are started in alertable wait state and jobs are submitted as
* APC (asynchronous procedure call).
*/

for the caml_unix_select() implementation.

(I haven't been following OCaml 5 deeply, but hopefully we don't end up where Coq et al don't work on Windows.)

@gasche
Copy link
Member

gasche commented Jul 8, 2022

Actually, maybe there is already a way using the C OCaml API where we could implement Domain.signal that would set the particular domain to be interrupted in the next allocation. Is that the case?

My non-expert understanding:

  1. No, that is not currently the case
  2. It would be possible and not too hard to prototype something like this in the runtime (or the capacities to define this as a library, for example hooks that run on each domain in each minor collection and are allowed to raise exceptions)
  3. But there is a good risk that the resutling system would be brittle in unexpected ways, for reasons (4) and (5).
  4. The runtime needs to be carefully reviewed and tweaked to ensure that raising exceptions from allocation sites will not blow things up. (This was a good share of the work on Memprof, which has to be re-done for the new Multicore runtime, and has partly been done by @gadmm already.)
  5. It is hard to write correct code without being able to reason about where exceptions may or may not interrupt the computation. For example idioms like opening a file handler then installing a close continuation using Fun.protect are incorrect if the call to Fun.protect can blow in your face with an exception. This is why @gadmm is proposing to implement masking.

@gasche
Copy link
Member

gasche commented Jul 8, 2022

Note: my understanding that "interrupting I/O" is not in scope of @ejgallego's question on how to terminate (concurrent) search processes in the application domain he mentions, which are CPU- and memory-heavy but not IO-heavy. And even when we do IO there, it may suffice to make I/O asynchronous (as done in, say, Lwt) rather than actually cancellable: it's mostly okay if an I/O computation keeps going on for longer than necessary if it does not block computation.

@ejgallego
Copy link
Author

@gasche , I agree on all fronts, indeed this is not easy, but I don't see any other way for systems like Coq who would like to start checking proofs, then interrupt when the user for example does an edit, right?

Indeed the workloads I'm thinking here are not IO bound at all, in fact, I'm happy if (explicit) IO is forbidden withing the solution. There are already pretty good ways to handle heavy I/O tasks concurrently I understand.

@ejgallego
Copy link
Author

At allocations and at explicit "polling" points introduced by the compiler when it sees code that might loop forever without allocating. The long story is at #10039

Thanks, that was a fascinating read !

@kayceesrk
Copy link
Contributor

One possibility is to not implement cancellation at the domain level but at the user-level thread (fiber) level in domainslib. Eio already has cancellation of fibers. Such a cancellation may perhaps be implemented on top of @gadmm's memprof-limits features.

@gadmm
Copy link
Contributor

gadmm commented Jul 9, 2022

@gasche I do not think the points 4. and 5. have to be considered in this issue; they are general issues which concerns the asynchronous exceptions that are already available in OCaml. For 4. there was ongoing work to further fix the runtime, because in many places it was/is not ready to receive asynchronous exceptions (with many parts of the runtime left unaudited). For 5. there has been a PR open for a while to implement masking (in the meanwhile memprof-limits shows that it is already possible to implement a form of masking in current OCaml).

@kayceesrk I agree the question is best viewed as not specific to domains, but I am not sure fibers are the right abstraction either. I think you want to be parametric in a notion of thread id. For instance if you want to use memprof-limits to interrupt domainslib tasks, it would need to track the cancellation state on a per-task basis. Therefore one would change memprof-limits to accept as parameter a Thread_id module as follows:

module type Thread_id = sig
  type t
  val self : unit -> t
  val compare : t -> t -> int
end

where the provided self would return a unique id of the currently-running domainslib task. IIUC, since you can have several effects in a program, the notions of fiber and task do not necessarily coincide.

The biggest challenge would be rewriting domainslib to protect its shared state from asynchronous exceptions. Memprof-limits provides a bracket combinator; it seems to be enough for the use-case of the Coq proof assistant (which does use shared state in its interruptible parts but not too much), but it is probably more limited for libraries which I suspect state-intensive such as domainslib.

Domainslib seems pleasingly simple, which makes this an interesting experiment. Maybe it would show that you need a more general notion of resource first.

@ejgallego

Is it written down somewhere in what cases does a domain check for async exceptions ? (I assume in allocation, right?)

I want to add:

  • entering blocking sections
  • explicit polling from C (caml_process_pending_actions*)
  • not the allocations done from C

@cemerick
Copy link

Just thought I'd chime in here to say that I also have a use case for this, in a context adjacent to the OP's coq work (the evaluation of a constraint satisfaction logic). As it stands, I've arranged things so that evaluations are run in a separate process, so that reasonable limits can be placed on runtime duration, memory utilization, and so on. It may be that planned instrumentation of the logic interpreter will allow for an "interrupt" as necessary, but I'd much prefer to use a Domain-level mechanism for forcible termination of a task so that the interop friction of process parallelism can be eliminated entirely.

So, insofar as movement on this topic occurs, I would be very glad to run any implementation through its paces.

@ejgallego
Copy link
Author

Hi folks, finally I got some time to port coq-lsp to memprof-limits (in OCaml 4.x) and I'm very happy to report that the token-based interruption works fantastically in all cases where Coq's polling didn't.

[Recall that polling in critical paths like conversion is not very viable in a system like Coq, due to the perf hit]

@gadmm
Copy link
Contributor

gadmm commented Jun 22, 2023

@ejgallego refers to "token limits" https://guillaume.munch.name/software/ocaml/memprof-limits/#token-limits, whereby a thread can ask another thread (preferably CPU-bound rather than IO-bound) to stop by setting a cancellation token, with polling done automatically and regularly from a memprof callback. Support requires memprof for OCaml 5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants