Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add worker id to on_thread_park() callback (for stuck worker watchdog) #6354

Conversation

theron-eg
Copy link

It's possible to find out from WorkerMetrics which workers are not actively polling for tasks. However it's not possible to tell if non-polling workers are merely idle (parked) or if they are stuck. By adding the worker id (same usize as used in WorkerMetrics calls) to the on_thread_park() and on_thread_unpark() callbacks it is possible to track which specific workers are parked. Then any worker that is not polling tasks and is not parked is a worker that is stuck.

With this information it's possible to create a watchdog that alerts or kills the process if a worker is stuck for too long.

Fixes #6353

Motivation

If a task gets stuck and doesn't yield back to the tokio runtime, other tasks can be starved (even if other workers are idle). I would like a watchdog that can detect when a task is stuck, and either alert or kill the process.

Using WorkerMetrics and worker_poll_count() it's possible to determine if a worker is not polling for new tasks. However it's not possible to tell if that worker is merely parked, or stuck.

Solution

Include a worker id in the on_thread_park() and on_thread_unpark() callbacks so that the idle / busy status of each worker can be inferred. Then it's possible for a watchdog thread to monitor worker_poll_count() and the parked/unparked status of each worker, and do something about it.

It's possible to find out from WorkerMetrics which workers are not
actively polling for tasks.  However it's not possible to tell if
non-polling workers are merely idle (parked) or if they are stuck.
By adding the worker id (same usize as used in WorkerMetrics calls)
to the on_thread_park() and on_thread_unpark() callbacks it is
possible to track which specific workers are parked.  Then any
worker that is not polling tasks and is not parked is a worker
that is stuck.

With this information it's possible to create a watchdog that alerts
or kills the process if a worker is stuck for too long.
@github-actions github-actions bot added R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-multi-thread-alt Run loom multi-thread alt tests on this PR labels Feb 16, 2024
@@ -684,7 +688,7 @@ impl Context {
/// after all the IOs get dispatched
fn park(&self, mut core: Box<Core>) -> Box<Core> {
if let Some(f) = &self.worker.handle.shared.config.before_park {
Copy link
Author

@theron-eg theron-eg Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems better to me if this check were inside the if core.transition_to_parked() check below, to avoid generating spurious callbacks. However it looks like it might be this way for tracing? Since I don't understand the details I just left it as is.

@maminrayej maminrayej added A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime labels Feb 16, 2024
Comment on lines 639 to 642
pub fn on_thread_park<F>(&mut self, f: F) -> &mut Self
where
F: Fn() + Send + Sync + 'static,
F: Fn(usize) + Send + Sync + 'static,
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change. We can't do that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Darksonn It's a change to an API that is available only when --cfg tokio_unstable is used. I assumed those APIs were subject to modification. Am I incorrect?

If that's not the case I can certainly add a on_thread_park2() method and deprecate the old one (since the new method would have all of the functionality of the old one). Or how would you suggest evolving the API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that we can change --cfg tokio_unstable methods, but the on_thread_park method is stable and accessible without --cfg tokio_unstable.

@Darksonn Darksonn closed this Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-multi-thread-alt Run loom multi-thread alt tests on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add worker id to on_thread_park() and on_thread_unpark() callbacks (for stuck worker watchdog)
3 participants