Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

funcx-endpoint can start a duplicate endpoint on multi-login-node systems #581

Open
yadudoc opened this issue Aug 27, 2021 · 0 comments
Open
Labels
bug Something isn't working

Comments

@yadudoc
Copy link
Collaborator

yadudoc commented Aug 27, 2021

Describe the bug
funcx-endpoint list relies on checking for the PID listed in the daemon.pid file to determine whether the endpoint is live. On multi-login-node systems this check will fail because the process might be on a different login node. The issue here is that the funcx-endpoint list will erroneously say the endpoint is disconnected, and then when the user tries to start the endpoint, the cli will wipe the current daemon.pid file in an attempt to cleanup, and then start a new endpoint with the same endpoint_id, ending up in a broken state.

To Reproduce
Steps to reproduce the behavior, for e.g:

  1. Install funcx-endpoint==0.3.2 with Python 3.7/3.8 on cluster
  2. Connect to loginnode01 of many
  3. Run funcx-endpoint configure test; funcx-endpoint start test
  4. Connect to loginnode02
  5. Run funcx-endpoint list; This will show test is disconnected
  6. Run funcx-endpoint start test.

Expected behavior
funcx-endpoint list should not show a connected endpoint on another login node as disconnected.
funcx-endpoint start should not wipe the daemon.pid, and start a duplicate endpoint with the same endpoint id.

Distributed Environment

  • Running on a multi-login-node system
@yadudoc yadudoc added the bug Something isn't working label Aug 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant