Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancellation may not always immediately happen at checkpoints #2973

Open
arthur-tacca opened this issue Mar 14, 2024 · 8 comments
Open

Cancellation may not always immediately happen at checkpoints #2973

arthur-tacca opened this issue Mar 14, 2024 · 8 comments

Comments

@arthur-tacca
Copy link
Contributor

arthur-tacca commented Mar 14, 2024

Updated issue text (after thinking about it a bit)

Summary

Any awaitable function in Trio (i.e., await trio.foo(...)) is guaranteed to be a checkpoint. That means, aside from being a scheduling point (which is not relevant to this issue), that it is a cancellation point: i.e., if the function is called from within a cancellation scope that is cancelled (subject to shielding rules) then it will raise trio.Cancelled. Of course, a function that is waiting (in a sleep, for I/O data, etc.) can also be interrupted by a cancellation (with a few well-documented exceptions).

But what if the cancellation scope becomes cancelled just as the function is finishing? Then, surprisingly (to me), it might not raise trio.Cancelled. Here's a simple example:

import trio

async def wait_for_event(ev: trio.Event):
    await ev.wait()
    print("If this prints then wait() did not raise")

async def main():
    async with trio.open_nursery() as n:
        ev = trio.Event()
        n.start_soon(wait_for_event, ev)
        await trio.sleep(0.1)
        ev.set()
        n.cancel_scope.cancel()

trio.run(main)

This will print the message. That's because await ev.wait() simply returns, despite being in a cancelled scope at that point.

What's more, if you swap the last two lines of main() (ev.set() and n.cancel_scope.cancel()) then the message will not be printed, because await ev.wait() raises a trio.Cancelled exception instead. I would expect that the behaviour depends on the overall state at that moment (the event is in a set state and in a cancelled context, so one or the other deterministically wins). Instead, it depends on the order things happened, even between checkpoints. I can see why that is likely to have happened internally, but it still seems against the spirit of Trio.

Why I care

Here's an almost-concrete example of why this is a problem, using my little aioresult library:

async def use_result(rc: aioresult.ResultCapture):
    # ...
    await rc.wait_done()
    val = rc.result()
    # ... use val ...

async def main():
    async with trio.open_nursery() as n:
        rc = aioresult.ResultCapture(n, produce_result)
        n.start_soon(use_result, rc)
        # ... some time later ...
        if some_condition:
            n.cancel_scope.cancel()

Here's the risk: produce_result() and use_result() are both cancelled but, await rc.wait_done() returns without exception (because when produce_result() finished with trio.Cancelled that set the event inside aioresult.ResultCapture). That means that rc.result() raises an aioresult.TaskFailedException (because the routine it wraps raised an exception). Which means that the simple cancellation get converted into a full-fat exception propagated out of the nursery!

Now, I don't think this can happen because the event gets cancelled just before it's set in this case. But it seems like it's got a fragile dependency on undocumented behaviour here.

The request

I think there should be changes relating to this:

  • Trio's docs about checkpoints (the two main places listed in "documentation" in the original issue text) make it really clear that cancellation is not guaranteed to be checked as the call is finishing. In other words, if surrounding cancel scope is cancelled while await trio.foo() is in progress then it is possible for the routine to return without raising trio.Cancelled.
  • BUT, for some particular functions where it's possible, the functions are changed to add that guarantee and that is documented. That would include, IMO, at least Event.wait() and await trio.MemoryReceiveChannel.receive(), and probably also trio.lowlevel.ParkingLot. I'm not so interested in the lower-level synchronization primitives (CapacityLimiter, Lock, Semaphore and Condition) but I suppose it applies to those too.

I've not put my money where my mouth is by submitting a pull request!

Implementation

One problem with this idea is that I'm not sure how it could be efficiently implemented. Ideally, late in the body of these functions (after waiting for their condition to be true) we would re-check for cancellation but without doing a schedule point because we have just been scheduled. That means that we want a sync-coloured cancellation check, but that was suggested 5 years ago in #961 and is not done yet. - I now think this would be best implemented by optionally allowing trio.lowlevel.wait_task_reschedule() to still call the abort function, and therefore end up raising trio.Cancelled, even if trio.lowlevel.reschedule() has been called. I plan to add a comment with further details but that's the key idea. In the case of trio.Event, aside from passing a parameter to enable that behaviour, no other code changes whatsoever would be needed to change ("fix") its behaviour. For other synchronisation primitives (e.g., memory channels), it's a different matter. (To be continued...)

Original issue text

Show original issue text

Short summary

My reading of the documentation is that a await ... should always end in a Cancelled being raised if it's in an enclosing cancel scope that is cancelled, but that is not actually always the case in practice. This is perhaps a design decision – should that rule always be enforced, or is it allowed for a cancellation request to take a bit of scheduler delay to implement. If the decision has already been made (or we make that decision now) then this may just end up being a documentation issue (or maybe I'm just reading too much into it), but even if so then it could do with clarifying.

Actual behaviour

In reality, it is possible for a task to wake from an await without exception even if it is in a scope that has been explicitly cancelled.

This gist gives a complete example.

It involves running two tasks in a nursery:

  • One sends a few values to a memory channel (with send_nowait()) then immediately cancels the cancel scope for the nursery.
  • The other awaits receive() on the channel in a loop.

When I run it, the receive task gets one value from the memory channel before getting the cancellation exception.

Documentation

The documentation says that, at each checkpoint, Trio checks whether the task is cancelled:

Neither is really explicit about when within the checkpoint the cancellation state is checked. However, my subjective reading of them (especially the first one) is that is that they strongly imply that it is impossible for await trio.foo() to finish with anything other than a trio.Cancelled if it is in a cancelled scope (barring usual shielding rules) when it completes.

Most trio operations are conveniently in two parts: one that waits for the operation to become possible (e.g., socket read available, memory channel non-empty) and then one that synchronously performs that operation. So it certainly seems possible for the cancellation check to happen at the end of the wait step, in addition to before and (effectively) during.

The scheduling point would also happen alongside the wait, even if the condition is already true (and remains true). So, even if there is no actual wait, then it's possible that a cancellation would come into effect during the course of the await call, and the position of the cancellation check(s) still matters.

Side note about asyncio

Just for curiosity, I tried analogous code with asyncio, using an asyncio.TaskGroup in place of a nursery and simulating a cancel by raising an exception from one of the tasks. It appears that asyncio.Queue.get() is not a scheduling point, because a receive loop will happily pop 100s of values off of it in a loop after the task that send them has long finished with an exception (but a small perturbation to the code will stop cause it to be cancelled before getting any).

@A5rocks
Copy link
Contributor

A5rocks commented Mar 14, 2024

For the record, not all await calls are checkpoints (though every trio function does guarantee a checkpoint call, so that's irrelevant here).

I think the waiting itself will be cancellable, so it's not a problem of putting a cancel point right after waiting either. Maybe my impression is wrong, but I think this is a case where things seem fine.

Specifically, I think send_nowait reschedules the task, which since it happens first means that there is technically no "waiting" to be cancelled at that moment. Moving the cancel scope first should (I think?) cause what you want and it's worth noting that cancelling will happen the moment you try to get the next message (literally first call inside receive() is to throw cancelled) regardless of whether there is one or not, or call any other async function.

It might be worth clarifying in the docs that a "checkpoint" is specifically something that yields back into trio (AFAIK), so like, just making an async function doesn't necessarily mean it has one.

@arthur-tacca
Copy link
Contributor Author

For the record, not all await calls are checkpoints (though every trio function does guarantee a checkpoint call, so that's irrelevant here). ... It might be worth clarifying in the docs that a "checkpoint" is specifically something that yields back into trio (AFAIK), so like, just making an async function doesn't necessarily mean it has one.

Oops, I did only mean trio calls, that was just a slip sorry (and when I repeated the claim later I did say trio.foo()). Sorry for the distraction.

Moving the cancel scope first should (I think?) cause what you want

I don't have an application with this behaviour that I want to fix particularly. It's just I think that the docs should clarify that this could happen, or we should consider that we never want it to happen and then carefully check/fix all Trio functions (with a cancellation check after the wait/schedule/whatever is relevant).

@richardsheridan
Copy link
Contributor

richardsheridan commented Mar 15, 2024

However, my subjective reading of them (especially the first one) is that is that they strongly imply that it is impossible for await trio.foo() to finish with anything other than a trio.Cancelled if it is in a cancelled scope (barring usual shielding rules) when it completes.

The reality is that only trio.lowlevel.checkpoint and trio.lowlevel.checkpoint_if_cancelled necessarily have the behavior you describe. Other apis are a composite behavior that exists on a spectrum from always raising Cancelled immediately to under no circumstances raising Cancelled. (and I have used await wait_task_rescheduled(lambda _: Abort.FAILED) before...) For example, trio.to_thread.run_sync has a complex behavior with subtle edge cases.

This is why the docs are not explicit about why the docs are not explicit about when it checks... it varies! And it should be kept an internal detail unless a specific need arises so that we retain flexibility for refactoring.

Now what I would like to see when documenting cancellation is some distinction between cases where seeing Cancelled means the operation was unwound "as if it never happened" (the normal and "ideal" case) and cases where there's some kind of side effect left over (abandoned threads, temporary files, partially written data). One case I find irksome is various "aclose" methods that raise Cancelled after they've successfully closed the underlying thing. Some cases it makes sense and means "I didn't get the expected ACK from the other end yet" but on a MemoryReceiveChannel it's only there to "be a cancel point" which doesn't seem so valuable.

@arthur-tacca
Copy link
Contributor Author

@richardsheridan Let me get a few things out of the way before getting to the main point:

  • "Other apis ... under no circumstances raising Cancelled." No, this is explicitly not allowed by Trio's checkpoint rules. OK, if you pass in totally invalid arguments, e.g. passing a non-awaitable to Nursery.start_soon(), then it may raise another exception before doing a cancellation check, but that's hardly worth mentioning. That aside, at the very least, any await trio.foo() call must check for cancellation state that already exists before entering the call.
  • "For example, trio.to_thread.run_sync has a complex behavior with subtle edge cases." Your link is to trio.from_thread.run (note from_thread not to_thread), which is a sync function (it is not await trio.foo()) so doesn't count under these rules. It's also running an async function in a sync context so is bound to have some odd rules about cancellation.
  • "One case I find irksome is various "aclose" methods ... but on a MemoryReceiveChannel it's only there to "be a cancel point" which doesn't seem so valuable." This is off-topic but I think the main reason is so that MemoryReceiveChannel satisfies the ABC trio.ReceiveChannel so that you could potentially use it via that interface and, in future, switch to an alternative channel type that is not in-process (not that Trio has any implementation of those at the moment).

OK, to the main point:

The reality is that only trio.lowlevel.checkpoint and trio.lowlevel.checkpoint_if_cancelled necessarily have the behavior you describe.

My original example already makes it clear that not all Trio APIs have the behaviour I describe. The question is actually: could all Trio APIs have that behaviour? Would we want it? I don't think that is so clear.

Operations that have (potential) side effects throughout their duration could have a rule: always raise Cancelled if there's cancellation (even right at the end), but some or all of the side effects could happen. Examples include trio.to_thread.run_sync (it would be a hybrid of the two cancellation options it already has) and also trio.run_process and socket sendall (which already work this way except I don't know if they have a last-moment check for cancellation). These are actually the easiest APIs to add this rule to, because the caller already has to be prepared for there to be side effects if there's cancellation.

It would also be very easy to add this rule for operations that wait for a condition and then atomically mutate at the last moment, like I discussed in my earlier comment. This includes reading from a memory channel, as in my original example. I believe that reading from a socket also falls into this category in both select and IOCP implementations, but I don't believe it is guaranteed in principle (the original IOCP implementation did an async read).

The only API where I think this rule would be troublesome is where you wait to be able to perform a mutation, and then do it, but the mutation is not quite atomic, but perhaps involves a callback that will happen in just a moment (never waits indefinitely). Currently, you could give the strong cancellation guarantee (never has side effects if Cancelled thrown), but if you were required to add a last-moment cancellation check you would lose that. But do any such APIs actually exist in Trio?

This is why the docs are not explicit about why the docs are not explicit about when it checks... it varies! And it should be kept an internal detail unless a specific need arises so that we retain flexibility for refactoring.

This sounds like a very general statement, but it's actually not: the only aspect currently not documented is the last-moment check we're talking about now.

As I've already said, a check at the start is always required (although it's allowed to have side-effects, as you mentioned for aclose()). Then the check during the duration of a wait is implicit – if it's not possible (e.g. trio.to_thread) you better believe that it should be documented! So that only leaves the check at the end that is currently not documented.

At the very least, I think the general docs should be updated to note that checkpoints may return without raising Cancelled if the cancellation state takes effect during the call to the function.

Now what I would like to see when documenting cancellation is some distinction between cases where seeing Cancelled means the operation was unwound "as if it never happened" (the normal and "ideal" case) and cases where there's some kind of side effect left over (abandoned threads, temporary files, partially written data).

Stealing from C++'s terminology of "strong exception guarantee" vs "weak exception guarantee", I would call these cases strong cancellation vs weak cancellation. Agreed they should definitely be documented. In fact, I think they probably normally are, but no doubt there are gaps. Certainly, I don't see anything about this for SocketStream.receive_some(), but as I noted above I'm not sure that it's guaranteed to have strong cancellation, it just happens all current implementations do. I wonder if it's for a similar reason to MemoryReceiveChannel.aclose() – the idea might be to make us only rely on a fairly abstract general interface (e.g. ReceiveStream instead of SocketStream specifically) so that it's easier to switch to an alternative implementation later (e.g., SSLStream).

@Zac-HD
Copy link
Member

Zac-HD commented Mar 28, 2024

Here's the current state of affairs:

import trio

async def no_internal_checkpoints():
    pass

@trio.run
async def loops_forever():
    with trio.fail_after(0.1):  # does not fire, due to lack of checkpoints!
        while True:
            await no_internal_checkpoints()

...which is why https://github.com/python-trio/flake8-async provides the ASYNC910 rule (exit from async function with no guaranteed checkpoint or exception), as well as ASYNC100 (cancel scope with no await). However the former is optional and the latter doesn't warn about this code.


We can't introduce an implicit checkpoint on the way in to an await, because that would make the 'clean up even if cancelled' semantics of trio.abc.AsyncResource.aclose() impossible to implement.

Implicitly checkpointing on return seems semantically acceptable, but is infeasible to implement - you'd have to hijack the function-return logic in the interpreter to inject our library code. So here we are!

I'd be happy to see a PR for some more docs on this topic, but I don't think there's anything else we can do.

@arthur-tacca
Copy link
Contributor Author

That is unrelated to this issue.

Your point is that an arbitrary async function might not include a checkpoint. That is true.

But this issue is exclusively about async functions in the trio namespace. Those actually are always checkpoints. That is guaranteed, not because of some language level feature (as you say, it is possible to write an async function that is not a checkpoint), but because that is the contract they provide and they are carefully implemented to satisfy that contract.

However, they are only guaranteed to be checkpoints on the "way in". That means they may complete successfully even if they are in a scope that is cancelled when they finish.

My point is that:

  • In some cases, they could be adjusted to check for cancellation on the "way out" without affecting any other aspect of their semantics. But is it possible for all async functions in the trio namespace? I'm not sure but haven't seen any solid counter example.
  • If it is always possible, should trio add this guarantee and change all the relevant functions?
  • If not possible, or just not desired, I think the cancellation docs should be updated to clarify this corner case. In my opinion, the current docs make it sounds like this is currently guaranteed, so this is more than just a gap.

@Zac-HD
Copy link
Member

Zac-HD commented Mar 29, 2024

I don't think we can guarantee checkpoint on exit - only make it somewhat more likely:

async def some_fn(url):
    response = await do_some_io()
    # what we if get cancelled some time after this point?
    x = process_result(response)  
    return x

async def some_fn_with_checkpoint_on_exit(url):
    response = await do_some_io()
    x = process_result(response)
    await trio.lowlevel.checkpoint()
    # adding a checkpoint just moves that problem to here.
    return x

and I really don't think that's worth the code complexity when it's still not guaranteed.

If you have a suggested clarifying change to the docs I would be very happy to get that merged though!


But this issue is exclusively about async functions in the trio namespace. Those actually are always checkpoints. That is guaranteed, not because of some language level feature (as you say, it is possible to write an async function that is not a checkpoint), but because that is the contract they provide and they are carefully implemented to satisfy that contract.

Strictly speaking the invariant is that they either checkpoint, or raise an exception. You can therefore livelock with e.g.:

import contextlib, trio

@trio.run
async def livelock():
    # can't cancel because the loop body raises (and catches) an exception without checkpointing
    with trio.move_on_after(0.1):  
        while True:
            with contextlib.suppress(TypeError):
                await trio.sleep("1")  # raises an exception before checkpointing

and while I don't like this code, it doesn't violate any of Trio's invariants.

@arthur-tacca
Copy link
Contributor Author

@Zac-HD Are you saying that having await trio.lowlevel.checkpoint() after all other awaits in a function still doesn't guarantee that it will see any cancellations that happen in its lifetime? My impression is that it would, because the implementation appears to do the cancellation check after the schedule point, although you'd be better off using checkpoint_if_cancelled() to avoid a schedule point if there is no cancellation. (I'm not counting a cancellation scope with an impending deadline, which of course might become cancelled between the function returning and the caller checking it. That doesn't affect the race I describe in the issue) And it would even better if the cancellation check was sync coloured (#961) to avoid a schedule point if it does trigger.

But your snippet isn't what I was asking for anyway: it would change the semantics, e.g. for a memory channel it would already have popped the item off and then silently discard it. Maybe I'm taking you too literally. But my original point is that it seemed like you could avoid changing semantics on more cases than you might expect (e.g. socket reads are actually wait for data to become available + do sync read, so you could put your final cancel check in between).

Anyway, I've come to accept that there probably places where you really couldn't preserve semantics with my request (perhaps IOCP or SSL as I said earlier). So I've updated the original issue to request this behaviour only on particular cases where it can be guaranteed, and to update the documentation on the other cases.

(I get you point about cancellation not being checked if there's an exception; I mentioned that in an earlier comment and I don't think it affects my point.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants