Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set CPU scheduling affinity #995

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

sk1p
Copy link
Member

@sk1p sk1p commented Mar 17, 2021

Contributor Checklist:

Reviewer Checklist:

  • /azp run libertem.libertem-data passed

@sk1p
Copy link
Member Author

sk1p commented Mar 17, 2021

/azp run libertem.libertem-data

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov
Copy link

codecov bot commented Mar 17, 2021

Codecov Report

Merging #995 (82cd656) into master (148737f) will increase coverage by 1.07%.
The diff coverage is 33.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #995      +/-   ##
==========================================
+ Coverage   68.83%   69.90%   +1.07%     
==========================================
  Files         260      260              
  Lines       11950    12392     +442     
  Branches     1640     1765     +125     
==========================================
+ Hits         8226     8663     +437     
- Misses       3408     3411       +3     
- Partials      316      318       +2     
Impacted Files Coverage Δ
src/libertem/executor/dask.py 77.10% <33.33%> (-0.72%) ⬇️
src/libertem/executor/base.py 90.44% <0.00%> (-1.35%) ⬇️
src/libertem/analysis/sd.py 100.00% <0.00%> (ø)
src/libertem/analysis/apply_fft_mask.py 100.00% <0.00%> (ø)
src/libertem/executor/inline.py 94.33% <0.00%> (+0.05%) ⬆️
src/libertem/udf/stddev.py 98.37% <0.00%> (+0.41%) ⬆️
src/libertem/udf/base.py 94.80% <0.00%> (+2.75%) ⬆️
src/libertem/common/buffers.py 90.08% <0.00%> (+4.69%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 148737f...82cd656. Read the comment docs.

@sk1p
Copy link
Member Author

sk1p commented Mar 17, 2021

Test failure is a timeout, maybe #996 should be merged before this one.

@sk1p
Copy link
Member Author

sk1p commented Mar 17, 2021

/azp run libertem.libertem-data

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@uellue
Copy link
Member

uellue commented Mar 18, 2021

Could it be possible that LiberTEM sees cores in CI where it can actually not run anything? For example, if a Docker container is limited to specific cores https://thorsten-hans.com/docker-container-cpu-limits-explained#assign-containers-to-dedicated-cpus, would it still "see" the other CPUs and then try to pin workers to these, where they will never run?

I'll close and open again to re-run, to see if this was a glitch.

@uellue uellue closed this Mar 18, 2021
@uellue uellue reopened this Mar 18, 2021
@sk1p
Copy link
Member Author

sk1p commented Mar 18, 2021

Could it be possible that LiberTEM sees cores in CI where it can actually not run anything?

Uhh that could be possible. Maybe we can't blindly use the "CPU number" from 0 to N from the cluster spec for the affinity, it's possible we have to build up a list of "eligible" core IDs, which we index into.

We can also leave this PR open for a while until we find a solution, as it's not high priority right now.

@sk1p
Copy link
Member Author

sk1p commented Mar 18, 2021

Could it be possible that LiberTEM sees cores in CI where it can actually not run anything?

At least with the docker method described in the article, sched_setaffinity return EINVAL if there is a mismatch:

docker run --rm -it --cpuset-cpus 1,3,5 debian:stable /bin/bash
[...]
>>> import os
>>> os.sched_setaffinity(0, [7])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 22] Invalid argument
>>> os.sched_setaffinity(0, [1])
>>> os.sched_setaffinity(0, [2])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 22] Invalid argument
>>> os.sched_setaffinity(0, [3])
>>> os.sched_setaffinity(0, [4])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 22] Invalid argument

@uellue
Copy link
Member

uellue commented Mar 18, 2021

Discussion with @sk1p:

Use cases for pinning:

  • Default: Pin one worker per physical core since that brings measurable benefits in usual cases
  • TODO: Check how pinning affects simultaneous CPU and GPU use since a running CUDA operation will spin one CPU thread at 100 %. This CPU thread should be close to the GPU in question -- @sk1p if I recall your work on multi-GPU computation, this can make a real difference? Maybe allowing CPU pinning for CUDA workers and leave CPU workers that would normally be pinned to that CPU unpinned so that the OS can schedule them elsewhere when the CUDA CPU thread is spinning?
  • Disable pinning. This is important for running in containers that might be restricted to specific cores, which might even change at run time.
  • Handle a failed pinning attempt (OSError) gracefully.
  • Allow customized pinning. Use case: Live processing on a large NUMA system where soft real time tasks related to processing camera data should be pinned to cores close to specific network interfaces and/or storage devices. LiberTEM workers should be pinned to keep enough cores free for such tasks.

@uellue
Copy link
Member

uellue commented Mar 18, 2021

A MVP could be opt-out for pinning, i.e. default pinning or no pinning. Second step would be customizable pinning (proposal by @sk1p) :

libertem-worker --cpuset 1,2,4,5
libertem-server --no-cpu-affinity
libertem-server --cpuset 1,2,4,5,7-23

Third step would be pinning setup for CPUs + multiple CUDA devices. Since that is pretty complex, perhaps one could develop a routine that can make a proposal for a pinning scheme, for example by measuring transfer rates or latencies between specific cores and GPUs?

@sk1p sk1p added the WIP label Mar 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants