kubernetes worker pods restart forever #601

benclifford · 2021-10-01T12:32:45Z

Describe the bug
Kubernetes worker pods, perhaps only ones which did not register properly, accumulate forever.
The worker process exits "normally" without an indication in kubectl logs that there is some error, but kubernetes immediately restarts that worker (which then fails again).
Nothing causes these workers to go away.

For example, here is one that has been restarted a 1607 times, since it was initially launched over two weeks ago.

root@amber:~# minikube kubectl get pods
NAME                                            READY   STATUS             RESTARTS           AGE
funcx-1632329996841                             0/1     CrashLoopBackOff   1607 (4m31s ago)   8d
[...]

Collecting funcx-endpoint>=0.2.0
  Downloading funcx_endpoint-0.3.3-py3-none-any.whl (91 kB)

...

Installing collected packages: pycparser, cffi, zipp, urllib3, typing-extensions, six, pyjwt, idna, cryptography, charset-normalizer, certifi, requests, pynacl, importlib-metadata, bcrypt, typeguard, tblib, pyzmq, pyrsistent, pyparsing, psutil, paramiko, lockfile, globus-sdk, docutils, dill, click, attrs, websockets, typer, texttable, python-daemon, py, parsl, packaging, jsonschema, fair-research-login, decorator, configobj, retry, funcx, funcx-endpoint
Successfully installed attrs-21.2.0 bcrypt-3.2.0 certifi-2021.5.30 cffi-1.14.6 charset-normalizer-2.0.6 click-8.0.1 configobj-5.0.6 cryptography-35.0.0 decorator-5.1.0 dill-0.3.4 docutils-0.17.1 fair-research-login-0.2.3 funcx-0.3.3 funcx-endpoint-0.3.3 globus-sdk-2.0.1 idna-3.2 importlib-metadata-4.8.1 jsonschema-4.0.1 lockfile-0.12.2 packaging-21.0 paramiko-2.7.2 parsl-1.1.0 psutil-5.8.0 py-1.10.0 pycparser-2.20 pyjwt-1.7.1 pynacl-1.4.0 pyparsing-2.4.7 pyrsistent-0.18.0 python-daemon-2.3.0 pyzmq-22.3.0 requests-2.26.0 retry-0.9.2 six-1.16.0 tblib-1.7.0 texttable-1.6.4 typeguard-2.12.1 typer-0.4.0 typing-extensions-3.10.0.2 urllib3-1.26.7 websockets-9.1 zipp-3.6.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
PROCESS_WORKER_POOL main event loop exiting normally

Crossref parsl Parsl/parsl#2132 -- it's possible/likely that the parsl kubernetes code also has this behaviour.

To Reproduce
Start a worker with the python versions incorrectly configured.

Expected behavior
Broken worker pods should not accumulate without bound.

Environment
my minikube environment on ubuntu

The text was updated successfully, but these errors were encountered:

benclifford · 2022-02-01T15:48:35Z

In the parsl model of execution, a worker that fails should go away - the LRM/provider layer shouldn't be causing the workers to restart. Instead, feedback loop involves the parsl scaling strategy layer, which makes decisions about whether to restart more workers to replace failed workers if there is pressure for that many workers to still exist.

Perhaps funcx's fork of htex preserves that model - I'm unsure.

In the kubernetes context, maybe that means that worker pods should have restartPolicy: never, so that they go away on failure.

This is a continuation of the change introduced in parsl in Parsl/parsl#1073 to use pods rather than deployments for worker management, to move more management into the strategy code, away from kubernetes where it is not managed correctly.

benclifford · 2022-02-01T15:50:27Z

In addition to my initial comment of these being pods that "did not register properly", I am experiencing this when killing and restarting an endpoint using "k8 delete pod funcx-endpoint-....". The worker pods managed by that endpoint are left behind forever.

benclifford · 2022-03-29T15:13:01Z

cross ref parsl issue Parsl/parsl#2199 - that issue is exacerbated by this issue.

benclifford added the bug Something isn't working label Oct 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubernetes worker pods restart forever #601

kubernetes worker pods restart forever #601

benclifford commented Oct 1, 2021

benclifford commented Feb 1, 2022 •

edited

benclifford commented Feb 1, 2022

benclifford commented Mar 29, 2022

kubernetes worker pods restart forever #601

kubernetes worker pods restart forever #601

Comments

benclifford commented Oct 1, 2021

benclifford commented Feb 1, 2022 • edited

benclifford commented Feb 1, 2022

benclifford commented Mar 29, 2022

benclifford commented Feb 1, 2022 •

edited