Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes worker pods restart forever #601

Open
benclifford opened this issue Oct 1, 2021 · 3 comments
Open

kubernetes worker pods restart forever #601

benclifford opened this issue Oct 1, 2021 · 3 comments
Labels
bug Something isn't working

Comments

@benclifford
Copy link
Contributor

Describe the bug
Kubernetes worker pods, perhaps only ones which did not register properly, accumulate forever.
The worker process exits "normally" without an indication in kubectl logs that there is some error, but kubernetes immediately restarts that worker (which then fails again).
Nothing causes these workers to go away.

For example, here is one that has been restarted a 1607 times, since it was initially launched over two weeks ago.

root@amber:~# minikube kubectl get pods
NAME                                            READY   STATUS             RESTARTS           AGE
funcx-1632329996841                             0/1     CrashLoopBackOff   1607 (4m31s ago)   8d
[...]
Collecting funcx-endpoint>=0.2.0
  Downloading funcx_endpoint-0.3.3-py3-none-any.whl (91 kB)

...

Installing collected packages: pycparser, cffi, zipp, urllib3, typing-extensions, six, pyjwt, idna, cryptography, charset-normalizer, certifi, requests, pynacl, importlib-metadata, bcrypt, typeguard, tblib, pyzmq, pyrsistent, pyparsing, psutil, paramiko, lockfile, globus-sdk, docutils, dill, click, attrs, websockets, typer, texttable, python-daemon, py, parsl, packaging, jsonschema, fair-research-login, decorator, configobj, retry, funcx, funcx-endpoint
Successfully installed attrs-21.2.0 bcrypt-3.2.0 certifi-2021.5.30 cffi-1.14.6 charset-normalizer-2.0.6 click-8.0.1 configobj-5.0.6 cryptography-35.0.0 decorator-5.1.0 dill-0.3.4 docutils-0.17.1 fair-research-login-0.2.3 funcx-0.3.3 funcx-endpoint-0.3.3 globus-sdk-2.0.1 idna-3.2 importlib-metadata-4.8.1 jsonschema-4.0.1 lockfile-0.12.2 packaging-21.0 paramiko-2.7.2 parsl-1.1.0 psutil-5.8.0 py-1.10.0 pycparser-2.20 pyjwt-1.7.1 pynacl-1.4.0 pyparsing-2.4.7 pyrsistent-0.18.0 python-daemon-2.3.0 pyzmq-22.3.0 requests-2.26.0 retry-0.9.2 six-1.16.0 tblib-1.7.0 texttable-1.6.4 typeguard-2.12.1 typer-0.4.0 typing-extensions-3.10.0.2 urllib3-1.26.7 websockets-9.1 zipp-3.6.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
PROCESS_WORKER_POOL main event loop exiting normally

Crossref parsl Parsl/parsl#2132 -- it's possible/likely that the parsl kubernetes code also has this behaviour.

To Reproduce
Start a worker with the python versions incorrectly configured.

Expected behavior
Broken worker pods should not accumulate without bound.

Environment
my minikube environment on ubuntu

@benclifford benclifford added the bug Something isn't working label Oct 1, 2021
@benclifford
Copy link
Contributor Author

benclifford commented Feb 1, 2022

In the parsl model of execution, a worker that fails should go away - the LRM/provider layer shouldn't be causing the workers to restart. Instead, feedback loop involves the parsl scaling strategy layer, which makes decisions about whether to restart more workers to replace failed workers if there is pressure for that many workers to still exist.

Perhaps funcx's fork of htex preserves that model - I'm unsure.

In the kubernetes context, maybe that means that worker pods should have restartPolicy: never, so that they go away on failure.

This is a continuation of the change introduced in parsl in Parsl/parsl#1073 to use pods rather than deployments for worker management, to move more management into the strategy code, away from kubernetes where it is not managed correctly.

@benclifford
Copy link
Contributor Author

In addition to my initial comment of these being pods that "did not register properly", I am experiencing this when killing and restarting an endpoint using "k8 delete pod funcx-endpoint-....". The worker pods managed by that endpoint are left behind forever.

@benclifford
Copy link
Contributor Author

cross ref parsl issue Parsl/parsl#2199 - that issue is exacerbated by this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant