endpoint does not recover from lost workers #704

benclifford · 2022-02-28T14:17:04Z

Describe the bug
A few days ago, one line above, someone reported suspicious behaviour with the systems ability to recover from endpoint worker processes dying.
I have broadly recreated this by manually killing a worker process in my kubernetes dev environment. That task sits in Z state, which means that that process which launched hasn't got round to retrieving the exit code yet.
A task subsequently launched against this endpoint goes as far as this:
Task is pending due to waiting-for-launch
and then nothing further happens.
Eventually I killed the worker pod and that progressed the task into this failure: from get_result: Serialization Error during: Task's exception object deserialization
I'll note the parsl fork of htex seems to be able to detect worker loss with: parsl.executors.high_throughput.errors.WorkerLost: Task failure due to loss of worker 0 on host parsl-dev-3-9-5568

I think that in the case of a multi-worker endpoint, and with users assuming that funcx "hangs sometimes, I'll just retry", then this would manifest as a performance problem rather than ongoing hangs: one worker being vanished/hung and blocked on an abandoned task while others still continue to perform work would result in subsequent work still proceeding, at reduced pace. (as long as you have one worker left, hung/missing workers will manifest as a performance reduction and a single hung task per worker)

To Reproduce
This was easily reproducible on my kubernetes dev cluster by putting a sys.exit into a funcx function and invoking it.

Expected behavior
Disappeared workers should restart, or some other recovery behaviour

Environment
my kubernetes dev cluster; main branch of everything as of 2022-02-28

WardLT · 2022-03-18T16:29:45Z

I see this too, and it tends to happen when my workers don't exit within one second.

It looks like if the worker doesn't close in 1s, it's return code is never collected: https://github.com/funcx-faas/funcX/blob/main/funcx_endpoint/funcx_endpoint/executors/high_throughput/funcx_manager.py#L610

benclifford added the bug Something isn't working label Feb 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

endpoint does not recover from lost workers #704

endpoint does not recover from lost workers #704

benclifford commented Feb 28, 2022

WardLT commented Mar 18, 2022

endpoint does not recover from lost workers #704

endpoint does not recover from lost workers #704

Comments

benclifford commented Feb 28, 2022

WardLT commented Mar 18, 2022