100% CPU load after cancel #695

Ratio2 · 2024-03-07T14:37:21Z

Things to check first

I have searched the existing issues and didn't find my bug already reported there
I have checked that my bug is still present in the latest release

AnyIO version

4.3.0

Python version

3.9.2, 3.12.1

What happened?

After Ctrl+C the program uses 100% CPU.
Looks like the problem in (since call_soon works without problems):

anyio/src/anyio/_backends/_asyncio.py

Line 231 in e0529a3

self._loop.call_soon_threadsafe(lambda: None)

https://github.com/python/cpython/blob/72dbea28cd3fce6fc457aaec2107a8e453073297/Lib/asyncio/base_events.py#L871

How can we reproduce the bug?

#!/usr/bin/env python3
from anyio import CancelScope, create_task_group, run, sleep


async def shield_task() -> None:
    with CancelScope(shield=True):
        await sleep(60)


async def task() -> None:
    async with create_task_group() as tg:
        tg.start_soon(shield_task)


async def main() -> None:
    async with create_task_group() as tg:
        tg.start_soon(task)


if __name__ == '__main__':
    run(main)

Ctrl+C

The text was updated successfully, but these errors were encountered:

Ratio2 · 2024-03-07T19:41:40Z

Infinite call recursion in

anyio/src/anyio/_backends/_asyncio.py

Line 469 in e0529a3

def _deliver_cancellation(self, origin: CancelScope) -> bool:

Ratio2 · 2024-03-11T07:31:33Z

Without SIGINT:

#!/usr/bin/env python3
from anyio import CancelScope, create_task_group, run, sleep


async def shield_task() -> None:
    with CancelScope(shield=True):
        await sleep(60)


async def task() -> None:
    async with create_task_group() as tg:
        tg.start_soon(shield_task)


async def main() -> None:
    async with create_task_group() as tg:
        tg.start_soon(task)
        tg.cancel_scope.cancel()


run(main)

agronholm · 2024-03-20T22:10:53Z

I can repro this too. But infinite call recursion? How do you figure that?

agronholm · 2024-03-20T22:38:05Z

Yeah, not infinite call recursion. The top level cancel scope continuously retries cancellation because it only sees that its immediate child task (task()) has an unshielded cancel scope.

agronholm · 2024-03-20T22:41:01Z

You can see the same behavior by modifying the main() function:

async def main() -> None:
    async with create_task_group() as tg:
        tg.start_soon(task)
        await wait_all_tasks_blocked()
        tg.cancel_scope.cancel()

agronholm · 2024-03-21T09:17:30Z

The trick is, I suppose, how to make it figure out that it shouldn't try to cancel the middle task which is waiting on the task which is in a shielded scope.

VivaLaPanda · 2024-04-18T16:07:38Z

Has anyone found a fix for this? We ran into it and it's killing our services. Currently looking at moving to Trio to avoid it

agronholm · 2024-04-18T19:04:52Z

Can you describe your use case where it's doing this?

VivaLaPanda · 2024-04-18T20:39:20Z

It's in a fairly complex Starlette app so it's hard to point to one thing, but it seems to be happening for us with HTTPX cancelations (among potential other things) inside of other cancel scopes. We noticed services in our cluster starting to get pinned to 100% CPU usage, and investigated. After a lot of digging, we realized that even after all open requests closed, there were still tasks in the event loop that should have already been cancelled. They weren't things we expected to use much/any CPU. It was, in all the cases we reproduced, HTTP calls. However, that's the main thing that the service was doing in our reproduction so its possible other codepaths which get cancelled/timed out would have similar issues.

It mainly seems to happen when the event loop is overloaded, so it's possibly some kind of race-condition with cancellation (in our specific case) that triggers us getting into this state, but the actual end state looks like its the same as this (an orphaned cancelling task that consumes all of the CPU).

agronholm · 2024-04-18T22:07:02Z

I'll try to get a fix for this into the next release, but I have to say it's a pretty tricky one to fix.

agronholm · 2024-05-02T22:35:01Z

My best attempt at fixing this involved shielding the part of TaskGroup.__aexit__() where it waits for child tasks to exit, but then that cause other tests to fail, as cancellation no longer seems to propagate. I'm still getting to the bottom of the issue.

Ratio2 added the bug Something isn't working label Mar 7, 2024

Ratio2 changed the title ~~100% CPU load after SIGINT~~ 100% CPU load after cancel Mar 11, 2024

agronholm added a commit that referenced this issue Apr 4, 2024

Added test for #695

f79b7b1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

100% CPU load after cancel #695

100% CPU load after cancel #695

Ratio2 commented Mar 7, 2024

Ratio2 commented Mar 7, 2024

Ratio2 commented Mar 11, 2024

agronholm commented Mar 20, 2024

agronholm commented Mar 20, 2024

agronholm commented Mar 20, 2024

agronholm commented Mar 21, 2024

VivaLaPanda commented Apr 18, 2024

agronholm commented Apr 18, 2024

VivaLaPanda commented Apr 18, 2024 •

edited

agronholm commented Apr 18, 2024

agronholm commented May 2, 2024

100% CPU load after cancel #695

100% CPU load after cancel #695

Comments

Ratio2 commented Mar 7, 2024

Things to check first

AnyIO version

Python version

What happened?

How can we reproduce the bug?

Ratio2 commented Mar 7, 2024

Ratio2 commented Mar 11, 2024

agronholm commented Mar 20, 2024

agronholm commented Mar 20, 2024

agronholm commented Mar 20, 2024

agronholm commented Mar 21, 2024

VivaLaPanda commented Apr 18, 2024

agronholm commented Apr 18, 2024

VivaLaPanda commented Apr 18, 2024 • edited

agronholm commented Apr 18, 2024

agronholm commented May 2, 2024

VivaLaPanda commented Apr 18, 2024 •

edited