Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract make_gpu_plan and it's users from #1207 #1353

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

sk1p
Copy link
Member

@sk1p sk1p commented Nov 15, 2022

For discussion - as this changes the way resources are allocated in a quite significant way, we might need to iterate a bit on the interface.

Edit @uellue, discussion with @sk1p:

We should use the information of lstopo and os.sched_getaffinity(0) to get the physical cores we are actually allowed to use, as opposed to the ones available in the system. Only with this information we can pin the PipelinedExecutor to the correct cores and start the correct number of workers. This is relevant for running within a container, in particular in CI.

  • Query os.sched_getaffinity(0) and topology to determine physical cores to run on.

Contributor Checklist:

Reviewer Checklist:

  • /azp run libertem.libertem-data passed
  • No import of GPL code from MIT code

@sk1p sk1p added the enhancement New feature or request label Nov 15, 2022
@sk1p
Copy link
Member Author

sk1p commented Nov 15, 2022

/azp run libertem.libertem-data

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@sk1p
Copy link
Member Author

sk1p commented Nov 15, 2022

I think there are two main topics to discuss:

  1. The change of resource allocation - after this PR, a CUDA worker "consumes" a CPU core, where previously the CUDA workers were counted on top of the CPU workers

  2. The interface itself - having to pass in cuda_info for spawning a cluster is a bit inconvenient (and backwards-incompatible); also it's convenient to have round-robin assignment for integer arguments as implemented in Allow integer arguments for cluster_spec #1336.

We might want to take a step back and enumerate the common use-cases and re-think how we can implement them in a convenient and backwards-compatible way.

@sk1p
Copy link
Member Author

sk1p commented Nov 15, 2022

/azp run libertem.libertem-data

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@sk1p sk1p mentioned this pull request Nov 15, 2022
14 tasks
@matbryan52
Copy link
Member

I think there are two main topics to discuss:

1. The change of resource allocation - after this PR, a CUDA worker "consumes" a CPU core, where previously the CUDA workers were counted on top of the CPU workers

2. The interface itself - having to pass in `cuda_info` for spawning a cluster is a bit inconvenient (and backwards-incompatible); also it's convenient to have round-robin assignment for integer arguments as implemented in [Allow integer arguments for `cluster_spec` #1336](https://github.com/LiberTEM/LiberTEM/pull/1336).

We might want to take a step back and enumerate the common use-cases and re-think how we can implement them in a convenient and backwards-compatible way.

For summary, the semantics were / are:

Originally:

  • ...dask.cluster_spec => cpus[Iterable] create n workers equal to length of iterable
  • ...pipelined._make_spec => cpus[Iterable] pin 1 worker each to exactly these cores, where supported
  • for both => , cudas[Iterable] pin 1 worker each to exactly these GPU ids (allow repeat pinning)

After #1336 for both dask and pipelined:

  • cpus[int] => create this many workers exactly
  • cudas[int] => assign N workers round robin across all available GPUs

With this PR as-is:

  • Require cuda_info dict if cudas is nonzero, else make_gpu_plan will fall over
  • Automatic extra workers per GPU within RAM limits (by default)
  • Maximum N workers per CUDA param
  • Reserve one CPU per CUDA worker (prioritise spreading workers over the GPUs before many workers on one GPU if CPU-limited)
  • cudas[int] => converted to range(int) i.e. pin to n GPU ids up to value (might not be available on a system)

@matbryan52
Copy link
Member

How about the following interface:

cluster_spec(
    cpus: int | Iterable[int],
    cudas: int | Iterable[int],
    has_cupy: bool,
    cuda_scaling: Optional[Dict[str, Any]] = None,
):
    ...
  • cpus: assign either n_workers [int] or len(cpus) [iterable], pin when possible, raise if pinning and num_cores < max(cpus)
  • cudas: assign n_workers across available GPUs (round_robin), or pin workers to GPU ids in iterable (allow repeats), raise if iterable contains GPU ids not visible, or no GPUs are visible at all and cudas > 0
  • has_cupy, unchanged
  • cuda_scaling, when provided require cudas is an Iterable specifying which CUDA device ids to scale on, and require cuda_scaling to contain these ids at least, insert scaling params (ram-per-worker, max-per-device) here also. Use this info to scale the workers on the assigned cudas as in the current make_gpu_plan

For reserving CPUs for each CUDA worker - I feel like it's quite analysis dependent? Some tasks take both CPU and GPU power, others might be almost purely GPU. If we could guarantee pinning we could compromise and assign 2 GPU workers to a shared single CPU core, for example, but this is not even an option with Dask or some OSs. It could be an option within cuda_scaling, though.

uellue added a commit to uellue/LiberTEM that referenced this pull request Feb 9, 2023
uellue added a commit to uellue/LiberTEM that referenced this pull request Feb 9, 2023
@uellue
Copy link
Member

uellue commented Feb 10, 2023

The change is supposed to address three issues:

First, when running a set of UDFs, CUDA workers are performing not only GPU but also a lot of CPU work in case not all UDFs support CUDA. When doing CUDA work they also spin a core at full speed. That means CUDA workers do, in fact, consume a CPU resource on the system and having too many CUDA workers in parallel to saturating the cores with CPU workers will lead to oversubscription of the CPU. Second, @matbryan52, you had observed that multiple GPU workers improve throughput. And third, the number of partitions should be a multiple of the worker count to reduce stragglers with smaller datasets, meaning the number of workers that actually work on a UDF run should be predictable and constant independent on the UDFs that are executed.

I'll try to rework the interface in such a way that it is backwards-compatible!

@uellue
Copy link
Member

uellue commented Feb 10, 2023

I've tested the current state of the PR with ApplyMasksUDF on the large_raw fixture on ptycho, showing how much it can improve performance. Definitely worthwhile IMO. :-)

Baseline

------------------------------------------- benchmark: 1 tests ------------------------------------------
Name (time in s)        Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------
test_masks_udf       6.1322  6.4635  6.2467  0.1361  6.2004  0.1905       1;0  0.1601       5           1
---------------------------------------------------------------------------------------------------------

Current PR

---------------------------------------- benchmark 'udf': 1 tests ---------------------------------------
Name (time in s)        Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------
test_masks_udf       3.8892  3.9851  3.9467  0.0366  3.9516  0.0465       2;0  0.2534       5           1
---------------------------------------------------------------------------------------------------------

@uellue
Copy link
Member

uellue commented Feb 10, 2023

/azp run libertem.libertem-data

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@uellue
Copy link
Member

uellue commented Feb 10, 2023

/azp run libertem.libertem-data

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@uellue
Copy link
Member

uellue commented Feb 10, 2023

@sk1p can you have a look if the changes address your concerns? :-)

@uellue
Copy link
Member

uellue commented Feb 13, 2023

/azp run libertem.libertem-data

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@uellue
Copy link
Member

uellue commented Feb 13, 2023

/azp run libertem.libertem-data

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

uellue added a commit to uellue/LiberTEM that referenced this pull request Mar 17, 2023
uellue added a commit to uellue/LiberTEM that referenced this pull request Mar 17, 2023
sk1p pushed a commit that referenced this pull request Mar 20, 2023
sk1p pushed a commit that referenced this pull request Mar 20, 2023
sk1p and others added 5 commits March 20, 2023 13:30
* Support CUDA-only workers
* Allow specifying clusters that don't offer CPU computation
* Effectively, merge CuPy and CPU workers into hybrid workers as far as available,
  and have separate workers for the overhang
* Allow specifying CPU and CUDA over-subscription: Add at least one worker
  for each time a device ID is specified
* Benchmark for ApplyMasksUDF to test impact of changes
Not sure why it only happens now...
@uellue
Copy link
Member

uellue commented Mar 20, 2023

Call with @sk1p and @matbryan52:

ctx = Context.make_with(cpus=10, gpus=10)

ctx = Context.make_with(cpus=10, gpus=(0, 0, 1, 1, 2))

# does this pin to CPU cores?
# Yes, if supported by the executor
# The first one missing means "whatever the Context thinks is the default executor". Currently "dask" (not sure if implemented already)
ctx = Context.make_with(cpus=(2, 4, 6, 8), gpus=(0, 0, 1, 1, 2))

ctx = Context.make_with('pipelined', cpus=(2, 4, 6, 8), gpus=(0, 0, 1, 1, 2))

# this will probably warn that the inline executor can't
# create multiple workers:
ctx = Context.make_with('inline', cpus=(2, 4, 6, 8), gpus=(0, 0, 1, 1, 2))

# just create a dask-distributed thingy:
ctx = Context.make_with('dask')

# this warns/raises that the resources can't be "created" as specified:
ctx = Context.make_with('dask-integration', cpus=(1, 2, 3))
def make_with(
    self,
    executor_type: Optional[str] = None,
    cpus: Optional[Union[int, Iterable[int]]] = None,
    gpus: Optional[Union[int, Iterable[int]]] = None,
    plot_class: Optional[str] = None,
):
    ...

TODO

  • specify details of a "resource specifier"
  • hybrid worker?

Started on the worker spec and a function that can assign workers based
on the planned input from make_with()
@uellue uellue added this to the 0.12 milestone Apr 19, 2023
uellue added a commit to sk1p/LiberTEM that referenced this pull request May 10, 2023
This remained from the extracted LiberTEM#1353
uellue added a commit that referenced this pull request May 10, 2023
This remained from the extracted #1353
@sk1p
Copy link
Member Author

sk1p commented May 22, 2023

Collection of links on topology:

And a way to check how this works when cgroup limits are present:

$ docker run --cpuset-cpus 0-2 --rm -it python python -c 'import os; print(os.sched_getaffinity(0))'
{0, 1, 2}
$ docker run --cpuset-cpus 0-2 --rm -it debian:stable nproc
3

sk1p added a commit to sk1p/LiberTEM that referenced this pull request Jul 4, 2023
This should now test equivalence of dask and pipelined executors, and
prepare for further changes in LiberTEM#1353 (`make_gpu_plan`).

* Split up `tests/test_local_cluster.py`
    * Move most dask-specific tests to `tests/executor/test_dask.py`
    * Move device class tests to `tests/executor/test_device_classes.py`
* Remove half-baked `test_correct_device_class_selected` again
* Run device class tests on both dask and pipelined executor
* Split up the `test_start_local_default` test into different tests for
  different situations (no GPU, CUDA available, CUDA+CuPy available),
  which makes it visible which case is _actually_ running in a concrete
  environment
* Move tests for `Task.get_resources` from
  `tests/udf/test_udf_runner.py` into `tests/udf/test_get_resources.py`
sk1p added a commit to sk1p/LiberTEM that referenced this pull request Jul 4, 2023
This should now test equivalence of dask and pipelined executors, and
prepare for further changes in LiberTEM#1353 (`make_gpu_plan`).

* Split up `tests/test_local_cluster.py`
    * Move most dask-specific tests to `tests/executor/test_dask.py`
    * Move device class tests to `tests/executor/test_device_classes.py`
* Remove half-baked `test_correct_device_class_selected` again
* Run device class tests on both dask and pipelined executor
* Split up the `test_start_local_default` test into different tests for
  different situations (no GPU, CUDA available, CUDA+CuPy available),
  which makes it visible which case is _actually_ running in a concrete
  environment
* Move tests for `Task.get_resources` from
  `tests/udf/test_udf_runner.py` into `tests/udf/test_get_resources.py`
sk1p added a commit to sk1p/LiberTEM that referenced this pull request Jul 4, 2023
This should now test equivalence of dask and pipelined executors, and
prepare for further changes in LiberTEM#1353 (`make_gpu_plan`).

* Split up `tests/test_local_cluster.py`
    * Move most dask-specific tests to `tests/executor/test_dask.py`
    * Move device class tests to `tests/executor/test_device_classes.py`
* Remove half-baked `test_correct_device_class_selected` again
* Run device class tests on both dask and pipelined executor
* Split up the `test_start_local_default` test into different tests for
  different situations (no GPU, CUDA available, CUDA+CuPy available),
  which makes it visible which case is _actually_ running in a concrete
  environment
* Move tests for `Task.get_resources` from
  `tests/udf/test_udf_runner.py` into `tests/udf/test_get_resources.py`
sk1p added a commit that referenced this pull request Jul 6, 2023
This should now test equivalence of dask and pipelined executors, and
prepare for further changes in #1353 (`make_gpu_plan`).

* Split up `tests/test_local_cluster.py`
    * Move most dask-specific tests to `tests/executor/test_dask.py`
    * Move device class tests to `tests/executor/test_device_classes.py`
* Remove half-baked `test_correct_device_class_selected` again
* Run device class tests on both dask and pipelined executor
* Split up the `test_start_local_default` test into different tests for
  different situations (no GPU, CUDA available, CUDA+CuPy available),
  which makes it visible which case is _actually_ running in a concrete
  environment
* Move tests for `Task.get_resources` from
  `tests/udf/test_udf_runner.py` into `tests/udf/test_get_resources.py`
@sk1p sk1p modified the milestones: 0.12, 0.13 Jul 25, 2023
@uellue uellue modified the milestones: 0.13, 0.14 Oct 25, 2023
@sk1p sk1p modified the milestones: 0.14, 0.15 Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants