Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spinning up too many workers #722

Open
WardLT opened this issue Mar 18, 2022 · 2 comments
Open

Spinning up too many workers #722

WardLT opened this issue Mar 18, 2022 · 2 comments
Labels
bug Something isn't working

Comments

@WardLT
Copy link
Contributor

WardLT commented Mar 18, 2022

Describe the bug
The FuncX manager seems like it is continually trying to spin up more workers than the map allows for. I know this because I'm using the "pin to accelerator" functionality, which causes an error if you attempt to allocate more than the number of available workers.

To Reproduce

TBD I don't have a minimal case of this happening yet.

Expected behavior

The manager not writing an error message every few seconds.

Environment

  • OS: Ubuntu
  • Python version @ client: 3.8.12
  • Python version @ endpoint: 3.8.12
  • funcx version @ client: 0.3.9
  • funcx-endpoint version @ endpoint: 0.3.9.dev0

Distributed Environment

  • Where are you running the funcX script from? Laptop
  • Where does the endpoint run? Workstation
  • What is your endpoint-uuid? acdb2f41-fd86-4bc7-a1e5-e19c12d3350d

manager.log

@WardLT WardLT added the bug Something isn't working label Mar 18, 2022
@WardLT
Copy link
Contributor Author

WardLT commented Aug 18, 2022

This is still a problem on FuncX v1

1660838271.649775 2022-08-18 10:57:51 ERROR MainProcess-2266833 MainThread-140719713703616 funcx_endpoint.executors.high_throughput.worker_map:181 spin_up_workers Error spinning up worker! Skipping...
Traceback (most recent call last):
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env-gpu/lib/python3.8/site-packages/funcx_endpoint/executors/high_throughput/worker_map.py", line 339, in add_worker
    device = self.available_accelerators.get_nowait()
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env-gpu/lib/python3.8/queue.py", line 198, in get_nowait
    return self.get(block=False)
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env-gpu/lib/python3.8/queue.py", line 167, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env-gpu/lib/python3.8/site-packages/funcx_endpoint/executors/high_throughput/worker_map.py", line 169, in spin_up_workers
    proc = self.add_worker(
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env-gpu/lib/python3.8/site-packages/funcx_endpoint/executors/high_throughput/worker_map.py", line 341, in add_worker
    raise ValueError(
ValueError: No accelerators are available. New worker must be created only after another is removed

@WardLT
Copy link
Contributor Author

WardLT commented May 23, 2023

I didn't write something important when creating this issue: the manager fails to spin up new workers until all other workers have exited. That's almost certainly related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant