Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% CPU load after cancel #695

Open
2 tasks done
Ratio2 opened this issue Mar 7, 2024 · 11 comments
Open
2 tasks done

100% CPU load after cancel #695

Ratio2 opened this issue Mar 7, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@Ratio2
Copy link

Ratio2 commented Mar 7, 2024

Things to check first

  • I have searched the existing issues and didn't find my bug already reported there

  • I have checked that my bug is still present in the latest release

AnyIO version

4.3.0

Python version

3.9.2, 3.12.1

What happened?

After Ctrl+C the program uses 100% CPU.
Looks like the problem in (since call_soon works without problems):

self._loop.call_soon_threadsafe(lambda: None)

https://github.com/python/cpython/blob/72dbea28cd3fce6fc457aaec2107a8e453073297/Lib/asyncio/base_events.py#L871

How can we reproduce the bug?

#!/usr/bin/env python3
from anyio import CancelScope, create_task_group, run, sleep


async def shield_task() -> None:
    with CancelScope(shield=True):
        await sleep(60)


async def task() -> None:
    async with create_task_group() as tg:
        tg.start_soon(shield_task)


async def main() -> None:
    async with create_task_group() as tg:
        tg.start_soon(task)


if __name__ == '__main__':
    run(main)

Ctrl+C

@Ratio2 Ratio2 added the bug Something isn't working label Mar 7, 2024
@Ratio2
Copy link
Author

Ratio2 commented Mar 7, 2024

Infinite call recursion in

def _deliver_cancellation(self, origin: CancelScope) -> bool:

@Ratio2
Copy link
Author

Ratio2 commented Mar 11, 2024

Without SIGINT:

#!/usr/bin/env python3
from anyio import CancelScope, create_task_group, run, sleep


async def shield_task() -> None:
    with CancelScope(shield=True):
        await sleep(60)


async def task() -> None:
    async with create_task_group() as tg:
        tg.start_soon(shield_task)


async def main() -> None:
    async with create_task_group() as tg:
        tg.start_soon(task)
        tg.cancel_scope.cancel()


run(main)

@Ratio2 Ratio2 changed the title 100% CPU load after SIGINT 100% CPU load after cancel Mar 11, 2024
@agronholm
Copy link
Owner

I can repro this too. But infinite call recursion? How do you figure that?

@agronholm
Copy link
Owner

Yeah, not infinite call recursion. The top level cancel scope continuously retries cancellation because it only sees that its immediate child task (task()) has an unshielded cancel scope.

@agronholm
Copy link
Owner

You can see the same behavior by modifying the main() function:

async def main() -> None:
    async with create_task_group() as tg:
        tg.start_soon(task)
        await wait_all_tasks_blocked()
        tg.cancel_scope.cancel()

@agronholm
Copy link
Owner

The trick is, I suppose, how to make it figure out that it shouldn't try to cancel the middle task which is waiting on the task which is in a shielded scope.

agronholm added a commit that referenced this issue Apr 4, 2024
@VivaLaPanda
Copy link

Has anyone found a fix for this? We ran into it and it's killing our services. Currently looking at moving to Trio to avoid it

@agronholm
Copy link
Owner

Can you describe your use case where it's doing this?

@VivaLaPanda
Copy link

VivaLaPanda commented Apr 18, 2024

It's in a fairly complex Starlette app so it's hard to point to one thing, but it seems to be happening for us with HTTPX cancelations (among potential other things) inside of other cancel scopes. We noticed services in our cluster starting to get pinned to 100% CPU usage, and investigated. After a lot of digging, we realized that even after all open requests closed, there were still tasks in the event loop that should have already been cancelled. They weren't things we expected to use much/any CPU. It was, in all the cases we reproduced, HTTP calls. However, that's the main thing that the service was doing in our reproduction so its possible other codepaths which get cancelled/timed out would have similar issues.

It mainly seems to happen when the event loop is overloaded, so it's possibly some kind of race-condition with cancellation (in our specific case) that triggers us getting into this state, but the actual end state looks like its the same as this (an orphaned cancelling task that consumes all of the CPU).

@agronholm
Copy link
Owner

I'll try to get a fix for this into the next release, but I have to say it's a pretty tricky one to fix.

@agronholm
Copy link
Owner

My best attempt at fixing this involved shielding the part of TaskGroup.__aexit__() where it waits for child tasks to exit, but then that cause other tests to fail, as cancellation no longer seems to propagate. I'm still getting to the bottom of the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants