Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

endpoint does not recover from lost workers #704

Open
benclifford opened this issue Feb 28, 2022 · 1 comment
Open

endpoint does not recover from lost workers #704

benclifford opened this issue Feb 28, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@benclifford
Copy link
Contributor

Describe the bug
A few days ago, one line above, someone reported suspicious behaviour with the systems ability to recover from endpoint worker processes dying.
I have broadly recreated this by manually killing a worker process in my kubernetes dev environment. That task sits in Z state, which means that that process which launched hasn't got round to retrieving the exit code yet.
A task subsequently launched against this endpoint goes as far as this:
Task is pending due to waiting-for-launch
and then nothing further happens.
Eventually I killed the worker pod and that progressed the task into this failure: from get_result: Serialization Error during: Task's exception object deserialization
I'll note the parsl fork of htex seems to be able to detect worker loss with: parsl.executors.high_throughput.errors.WorkerLost: Task failure due to loss of worker 0 on host parsl-dev-3-9-5568

I think that in the case of a multi-worker endpoint, and with users assuming that funcx "hangs sometimes, I'll just retry", then this would manifest as a performance problem rather than ongoing hangs: one worker being vanished/hung and blocked on an abandoned task while others still continue to perform work would result in subsequent work still proceeding, at reduced pace. (as long as you have one worker left, hung/missing workers will manifest as a performance reduction and a single hung task per worker)

To Reproduce
This was easily reproducible on my kubernetes dev cluster by putting a sys.exit into a funcx function and invoking it.

Expected behavior
Disappeared workers should restart, or some other recovery behaviour

Environment
my kubernetes dev cluster; main branch of everything as of 2022-02-28

@benclifford benclifford added the bug Something isn't working label Feb 28, 2022
@WardLT
Copy link
Contributor

WardLT commented Mar 18, 2022

I see this too, and it tends to happen when my workers don't exit within one second.

It looks like if the worker doesn't close in 1s, it's return code is never collected: https://github.com/funcx-faas/funcX/blob/main/funcx_endpoint/funcx_endpoint/executors/high_throughput/funcx_manager.py#L610

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants