-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract make_gpu_plan
and it's users from #1207
#1353
base: master
Are you sure you want to change the base?
Conversation
/azp run libertem.libertem-data |
Azure Pipelines successfully started running 1 pipeline(s). |
I think there are two main topics to discuss:
We might want to take a step back and enumerate the common use-cases and re-think how we can implement them in a convenient and backwards-compatible way. |
/azp run libertem.libertem-data |
Azure Pipelines successfully started running 1 pipeline(s). |
For summary, the semantics were / are: Originally:
After #1336 for both
With this PR as-is:
|
How about the following interface: cluster_spec(
cpus: int | Iterable[int],
cudas: int | Iterable[int],
has_cupy: bool,
cuda_scaling: Optional[Dict[str, Any]] = None,
):
...
For reserving CPUs for each CUDA worker - I feel like it's quite analysis dependent? Some tasks take both CPU and GPU power, others might be almost purely GPU. If we could guarantee pinning we could compromise and assign 2 GPU workers to a shared single CPU core, for example, but this is not even an option with Dask or some OSs. It could be an option within |
This is separated out in LiberTEM#1353
TODO include in LiberTEM#1353
The change is supposed to address three issues: First, when running a set of UDFs, CUDA workers are performing not only GPU but also a lot of CPU work in case not all UDFs support CUDA. When doing CUDA work they also spin a core at full speed. That means CUDA workers do, in fact, consume a CPU resource on the system and having too many CUDA workers in parallel to saturating the cores with CPU workers will lead to oversubscription of the CPU. Second, @matbryan52, you had observed that multiple GPU workers improve throughput. And third, the number of partitions should be a multiple of the worker count to reduce stragglers with smaller datasets, meaning the number of workers that actually work on a UDF run should be predictable and constant independent on the UDFs that are executed. I'll try to rework the interface in such a way that it is backwards-compatible! |
I've tested the current state of the PR with ApplyMasksUDF on the large_raw fixture on ptycho, showing how much it can improve performance. Definitely worthwhile IMO. :-) Baseline
Current PR
|
/azp run libertem.libertem-data |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run libertem.libertem-data |
Azure Pipelines successfully started running 1 pipeline(s). |
@sk1p can you have a look if the changes address your concerns? :-) |
/azp run libertem.libertem-data |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run libertem.libertem-data |
Azure Pipelines successfully started running 1 pipeline(s). |
This is separated out in LiberTEM#1353
TODO include in LiberTEM#1353
* Support CUDA-only workers * Allow specifying clusters that don't offer CPU computation * Effectively, merge CuPy and CPU workers into hybrid workers as far as available, and have separate workers for the overhang * Allow specifying CPU and CUDA over-subscription: Add at least one worker for each time a device ID is specified * Benchmark for ApplyMasksUDF to test impact of changes
Not sure why it only happens now...
Call with @sk1p and @matbryan52: ctx = Context.make_with(cpus=10, gpus=10)
ctx = Context.make_with(cpus=10, gpus=(0, 0, 1, 1, 2))
# does this pin to CPU cores?
# Yes, if supported by the executor
# The first one missing means "whatever the Context thinks is the default executor". Currently "dask" (not sure if implemented already)
ctx = Context.make_with(cpus=(2, 4, 6, 8), gpus=(0, 0, 1, 1, 2))
ctx = Context.make_with('pipelined', cpus=(2, 4, 6, 8), gpus=(0, 0, 1, 1, 2))
# this will probably warn that the inline executor can't
# create multiple workers:
ctx = Context.make_with('inline', cpus=(2, 4, 6, 8), gpus=(0, 0, 1, 1, 2))
# just create a dask-distributed thingy:
ctx = Context.make_with('dask')
# this warns/raises that the resources can't be "created" as specified:
ctx = Context.make_with('dask-integration', cpus=(1, 2, 3)) def make_with(
self,
executor_type: Optional[str] = None,
cpus: Optional[Union[int, Iterable[int]]] = None,
gpus: Optional[Union[int, Iterable[int]]] = None,
plot_class: Optional[str] = None,
):
... TODO
|
Started on the worker spec and a function that can assign workers based on the planned input from make_with()
This remained from the extracted LiberTEM#1353
This remained from the extracted #1353
Collection of links on topology: |
And a way to check how this works when cgroup limits are present: $ docker run --cpuset-cpus 0-2 --rm -it python python -c 'import os; print(os.sched_getaffinity(0))'
{0, 1, 2}
$ docker run --cpuset-cpus 0-2 --rm -it debian:stable nproc
3 |
This should now test equivalence of dask and pipelined executors, and prepare for further changes in LiberTEM#1353 (`make_gpu_plan`). * Split up `tests/test_local_cluster.py` * Move most dask-specific tests to `tests/executor/test_dask.py` * Move device class tests to `tests/executor/test_device_classes.py` * Remove half-baked `test_correct_device_class_selected` again * Run device class tests on both dask and pipelined executor * Split up the `test_start_local_default` test into different tests for different situations (no GPU, CUDA available, CUDA+CuPy available), which makes it visible which case is _actually_ running in a concrete environment * Move tests for `Task.get_resources` from `tests/udf/test_udf_runner.py` into `tests/udf/test_get_resources.py`
This should now test equivalence of dask and pipelined executors, and prepare for further changes in LiberTEM#1353 (`make_gpu_plan`). * Split up `tests/test_local_cluster.py` * Move most dask-specific tests to `tests/executor/test_dask.py` * Move device class tests to `tests/executor/test_device_classes.py` * Remove half-baked `test_correct_device_class_selected` again * Run device class tests on both dask and pipelined executor * Split up the `test_start_local_default` test into different tests for different situations (no GPU, CUDA available, CUDA+CuPy available), which makes it visible which case is _actually_ running in a concrete environment * Move tests for `Task.get_resources` from `tests/udf/test_udf_runner.py` into `tests/udf/test_get_resources.py`
This should now test equivalence of dask and pipelined executors, and prepare for further changes in LiberTEM#1353 (`make_gpu_plan`). * Split up `tests/test_local_cluster.py` * Move most dask-specific tests to `tests/executor/test_dask.py` * Move device class tests to `tests/executor/test_device_classes.py` * Remove half-baked `test_correct_device_class_selected` again * Run device class tests on both dask and pipelined executor * Split up the `test_start_local_default` test into different tests for different situations (no GPU, CUDA available, CUDA+CuPy available), which makes it visible which case is _actually_ running in a concrete environment * Move tests for `Task.get_resources` from `tests/udf/test_udf_runner.py` into `tests/udf/test_get_resources.py`
This should now test equivalence of dask and pipelined executors, and prepare for further changes in #1353 (`make_gpu_plan`). * Split up `tests/test_local_cluster.py` * Move most dask-specific tests to `tests/executor/test_dask.py` * Move device class tests to `tests/executor/test_device_classes.py` * Remove half-baked `test_correct_device_class_selected` again * Run device class tests on both dask and pipelined executor * Split up the `test_start_local_default` test into different tests for different situations (no GPU, CUDA available, CUDA+CuPy available), which makes it visible which case is _actually_ running in a concrete environment * Move tests for `Task.get_resources` from `tests/udf/test_udf_runner.py` into `tests/udf/test_get_resources.py`
For discussion - as this changes the way resources are allocated in a quite significant way, we might need to iterate a bit on the interface.
Edit @uellue, discussion with @sk1p:
We should use the information of
lstopo
andos.sched_getaffinity(0)
to get the physical cores we are actually allowed to use, as opposed to the ones available in the system. Only with this information we can pin the PipelinedExecutor to the correct cores and start the correct number of workers. This is relevant for running within a container, in particular in CI.os.sched_getaffinity(0)
and topology to determine physical cores to run on.Contributor Checklist:
Reviewer Checklist:
/azp run libertem.libertem-data
passed