Set CPU scheduling affinity #995

sk1p · 2021-03-17T15:40:23Z

Contributor Checklist:

I have added or updated my entry in the creators.json file
I have added a changelog entry for my contribution
I have added/updated documentation for all user-facing changes
I have added/updated test cases
[N/A] I have included the rebuilt production build of the client (only if changes were made to the GUI)

Reviewer Checklist:

/azp run libertem.libertem-data passed

sk1p · 2021-03-17T15:40:53Z

/azp run libertem.libertem-data

azure-pipelines · 2021-03-17T15:41:02Z

Azure Pipelines successfully started running 1 pipeline(s).

codecov · 2021-03-17T15:43:39Z

Codecov Report

Merging #995 (82cd656) into master (148737f) will increase coverage by 1.07%.
The diff coverage is 33.33%.

@@            Coverage Diff             @@
##           master     #995      +/-   ##
==========================================
+ Coverage   68.83%   69.90%   +1.07%     
==========================================
  Files         260      260              
  Lines       11950    12392     +442     
  Branches     1640     1765     +125     
==========================================
+ Hits         8226     8663     +437     
- Misses       3408     3411       +3     
- Partials      316      318       +2

Impacted Files	Coverage Δ
src/libertem/executor/dask.py	`77.10% <33.33%> (-0.72%)`	⬇️
src/libertem/executor/base.py	`90.44% <0.00%> (-1.35%)`	⬇️
src/libertem/analysis/sd.py	`100.00% <0.00%> (ø)`
src/libertem/analysis/apply_fft_mask.py	`100.00% <0.00%> (ø)`
src/libertem/executor/inline.py	`94.33% <0.00%> (+0.05%)`	⬆️
src/libertem/udf/stddev.py	`98.37% <0.00%> (+0.41%)`	⬆️
src/libertem/udf/base.py	`94.80% <0.00%> (+2.75%)`	⬆️
src/libertem/common/buffers.py	`90.08% <0.00%> (+4.69%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 148737f...82cd656. Read the comment docs.

sk1p · 2021-03-17T17:42:03Z

Test failure is a timeout, maybe #996 should be merged before this one.

sk1p · 2021-03-17T18:49:47Z

/azp run libertem.libertem-data

azure-pipelines · 2021-03-17T18:50:16Z

Azure Pipelines successfully started running 1 pipeline(s).

uellue · 2021-03-18T11:06:25Z

Could it be possible that LiberTEM sees cores in CI where it can actually not run anything? For example, if a Docker container is limited to specific cores https://thorsten-hans.com/docker-container-cpu-limits-explained#assign-containers-to-dedicated-cpus, would it still "see" the other CPUs and then try to pin workers to these, where they will never run?

I'll close and open again to re-run, to see if this was a glitch.

sk1p · 2021-03-18T11:19:10Z

Could it be possible that LiberTEM sees cores in CI where it can actually not run anything?

Uhh that could be possible. Maybe we can't blindly use the "CPU number" from 0 to N from the cluster spec for the affinity, it's possible we have to build up a list of "eligible" core IDs, which we index into.

We can also leave this PR open for a while until we find a solution, as it's not high priority right now.

sk1p · 2021-03-18T11:30:01Z

Could it be possible that LiberTEM sees cores in CI where it can actually not run anything?

At least with the docker method described in the article, sched_setaffinity return EINVAL if there is a mismatch:

docker run --rm -it --cpuset-cpus 1,3,5 debian:stable /bin/bash
[...]
>>> import os
>>> os.sched_setaffinity(0, [7])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 22] Invalid argument
>>> os.sched_setaffinity(0, [1])
>>> os.sched_setaffinity(0, [2])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 22] Invalid argument
>>> os.sched_setaffinity(0, [3])
>>> os.sched_setaffinity(0, [4])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 22] Invalid argument

uellue · 2021-03-18T11:55:18Z

Discussion with @sk1p:

Use cases for pinning:

Default: Pin one worker per physical core since that brings measurable benefits in usual cases
TODO: Check how pinning affects simultaneous CPU and GPU use since a running CUDA operation will spin one CPU thread at 100 %. This CPU thread should be close to the GPU in question -- @sk1p if I recall your work on multi-GPU computation, this can make a real difference? Maybe allowing CPU pinning for CUDA workers and leave CPU workers that would normally be pinned to that CPU unpinned so that the OS can schedule them elsewhere when the CUDA CPU thread is spinning?
Disable pinning. This is important for running in containers that might be restricted to specific cores, which might even change at run time.
Handle a failed pinning attempt (OSError) gracefully.
Allow customized pinning. Use case: Live processing on a large NUMA system where soft real time tasks related to processing camera data should be pinned to cores close to specific network interfaces and/or storage devices. LiberTEM workers should be pinned to keep enough cores free for such tasks.

uellue · 2021-03-18T12:00:55Z

A MVP could be opt-out for pinning, i.e. default pinning or no pinning. Second step would be customizable pinning (proposal by @sk1p) :

libertem-worker --cpuset 1,2,4,5
libertem-server --no-cpu-affinity
libertem-server --cpuset 1,2,4,5,7-23

Third step would be pinning setup for CPUs + multiple CUDA devices. Since that is pretty complex, perhaps one could develop a routine that can make a proposal for a pinning scheme, for example by measuring transfer rates or latencies between specific cores and GPUs?

sk1p added 2 commits March 17, 2021 16:37

Set CPU schduling affinity when setting up workers

de39fb4

Add changelog

143f46b

Add PR# to changelog entry

82cd656

uellue closed this Mar 18, 2021

uellue reopened this Mar 18, 2021

sk1p added the WIP label Mar 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set CPU scheduling affinity #995

Set CPU scheduling affinity #995

sk1p commented Mar 17, 2021 •

edited

sk1p commented Mar 17, 2021

azure-pipelines bot commented Mar 17, 2021

codecov bot commented Mar 17, 2021 •

edited

sk1p commented Mar 17, 2021

sk1p commented Mar 17, 2021

azure-pipelines bot commented Mar 17, 2021

uellue commented Mar 18, 2021

sk1p commented Mar 18, 2021

sk1p commented Mar 18, 2021

uellue commented Mar 18, 2021

uellue commented Mar 18, 2021

Set CPU scheduling affinity #995

Are you sure you want to change the base?

Set CPU scheduling affinity #995

Conversation

sk1p commented Mar 17, 2021 • edited

Contributor Checklist:

Reviewer Checklist:

sk1p commented Mar 17, 2021

azure-pipelines bot commented Mar 17, 2021

codecov bot commented Mar 17, 2021 • edited

Codecov Report

sk1p commented Mar 17, 2021

sk1p commented Mar 17, 2021

azure-pipelines bot commented Mar 17, 2021

uellue commented Mar 18, 2021

sk1p commented Mar 18, 2021

sk1p commented Mar 18, 2021

uellue commented Mar 18, 2021

uellue commented Mar 18, 2021

sk1p commented Mar 17, 2021 •

edited

codecov bot commented Mar 17, 2021 •

edited