Spinning up too many workers #722

WardLT · 2022-03-18T16:36:47Z

Describe the bug
The FuncX manager seems like it is continually trying to spin up more workers than the map allows for. I know this because I'm using the "pin to accelerator" functionality, which causes an error if you attempt to allocate more than the number of available workers.

To Reproduce

TBD I don't have a minimal case of this happening yet.

Expected behavior

The manager not writing an error message every few seconds.

Environment

OS: Ubuntu
Python version @ client: 3.8.12
Python version @ endpoint: 3.8.12
funcx version @ client: 0.3.9
funcx-endpoint version @ endpoint: 0.3.9.dev0

Distributed Environment

Where are you running the funcX script from? Laptop
Where does the endpoint run? Workstation
What is your endpoint-uuid? acdb2f41-fd86-4bc7-a1e5-e19c12d3350d

manager.log

WardLT · 2022-08-18T15:58:51Z

This is still a problem on FuncX v1

1660838271.649775 2022-08-18 10:57:51 ERROR MainProcess-2266833 MainThread-140719713703616 funcx_endpoint.executors.high_throughput.worker_map:181 spin_up_workers Error spinning up worker! Skipping...
Traceback (most recent call last):
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env-gpu/lib/python3.8/site-packages/funcx_endpoint/executors/high_throughput/worker_map.py", line 339, in add_worker
    device = self.available_accelerators.get_nowait()
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env-gpu/lib/python3.8/queue.py", line 198, in get_nowait
    return self.get(block=False)
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env-gpu/lib/python3.8/queue.py", line 167, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env-gpu/lib/python3.8/site-packages/funcx_endpoint/executors/high_throughput/worker_map.py", line 169, in spin_up_workers
    proc = self.add_worker(
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env-gpu/lib/python3.8/site-packages/funcx_endpoint/executors/high_throughput/worker_map.py", line 341, in add_worker
    raise ValueError(
ValueError: No accelerators are available. New worker must be created only after another is removed

WardLT · 2023-05-23T18:47:14Z

I didn't write something important when creating this issue: the manager fails to spin up new workers until all other workers have exited. That's almost certainly related

WardLT added the bug Something isn't working label Mar 18, 2022

yadudoc mentioned this issue May 23, 2023

Update Polaris example config #1154

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spinning up too many workers #722

Spinning up too many workers #722

WardLT commented Mar 18, 2022

WardLT commented Aug 18, 2022

WardLT commented May 23, 2023

Spinning up too many workers #722

Spinning up too many workers #722

Comments

WardLT commented Mar 18, 2022

WardLT commented Aug 18, 2022

WardLT commented May 23, 2023