Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Check if a ray task has errored without calling ray.get on it #45229

Open
justinvyu opened this issue May 9, 2024 · 0 comments
Open
Labels
core Issues that should be addressed in Ray Core enhancement Request for new feature and/or capability P0 Issues that should be fixed in short order

Comments

@justinvyu
Copy link
Contributor

Description

Goal: From a list of ray remote task futures, I want to be able to check if each of these has errored without needing to call ray.get individually on each element.

This feature is offered by similar async execution APIs:

Current workaround

We have a "check for failure" function in Ray Train, which may incur some unnecessary overhead to fetch objects:

for object_ref in finished:
# Everything in finished has either failed or completed
# successfully.
try:
ray.get(object_ref)
except RayActorError as exc:
failed_actor_rank = remote_values.index(object_ref)
logger.info(f"Worker {failed_actor_rank} has failed.")
return False, exc
except Exception as exc:

Use case

I am implementing a control loop where I want to check on the status of some actor tasks every N seconds. I want to know if these actor tasks have failed as soon as possible so I can trigger some error handling. This involves me running an "error check" in a loop with a small amount of sleep time:

while True:
    ready, remaining = ray.wait(tasks, num_returns=len(tasks), timeout=0.01)

    # I want to be able to collect errored tasks without calling ray.get.
    # I want to distinguish successful tasks vs. errored tasks from the output from ray.wait.
    errors = []
    for task in ready:
        try:
            ray.get(task)
        except Exception as e:
            errors.append(e)

cc: @jjyao @rkooo567

@justinvyu justinvyu added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) core Issues that should be addressed in Ray Core labels May 9, 2024
@jjyao jjyao added P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core enhancement Request for new feature and/or capability P0 Issues that should be fixed in short order
Projects
None yet
Development

No branches or pull requests

2 participants