New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet: consider real CPU topology when check Core/Socket free #124561
base: master
Are you sure you want to change the base?
Conversation
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Welcome @mtawaken! |
Hi @mtawaken. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mtawaken The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/ok-to-test |
/cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is an improvement, but also a partial fix. It seems to me a complete fix will be much more complex and perhaps would require changes even at cadvisor/cri level to expose a more detailed topology. The feasibility is uncertain.
I for myself I'd like to push this PR forward, but during the conversation inhttps://github.com//issues/124066 it seems more changes are needed to also fix full-pcpus-only
.
@@ -251,13 +251,13 @@ func (a *cpuAccumulator) isNUMANodeFree(numaID int) bool { | |||
// Returns true if the supplied socket is fully available in `a.details`. | |||
// "fully available" means that all the CPUs in it are free. | |||
func (a *cpuAccumulator) isSocketFree(socketID int) bool { | |||
return a.details.CPUsInSockets(socketID).Size() == a.topo.CPUsPerSocket() | |||
return a.details.CPUsInSockets(socketID).Size() == a.topo.CPUDetails.CPUsInSockets(socketID).Size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if CPUsPerSocket
is no longer used (it should be the case) let's delete it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done it
} | ||
|
||
// Returns true if the supplied core is fully available in `a.details`. | ||
// "fully available" means that all the CPUs in it are free. | ||
func (a *cpuAccumulator) isCoreFree(coreID int) bool { | ||
return a.details.CPUsInCores(coreID).Size() == a.topo.CPUsPerCore() | ||
return a.details.CPUsInCores(coreID).Size() == a.topo.CPUDetails.CPUsInCores(coreID).Size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
considering the issue #124066 as example, with this change the offlined core (id=25) won't be used, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes, the CPU(id=25) would not be used by container allocation because Core(id=25) has already allocated CPU(id=25) in previous Reserved Core allocation process. The buggy part is in
acc.takeFullCores
invoked bytakeByTopologyNUMAPacked
during container allocation. It thought Core(id=25) was free before this fix and used CPU(id=25),CPU(id=73) directly by refering raw Topo. - the core containing offline CPU(id=95) is not id=25, but id=47. it just affects
isCoreFree
. the offlined CPU(id=95) is never used because it does not appears in Topo(discovered by advisor) anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks
/triage accepted |
/release-note none |
/cc |
The only reference of About the
|
b719b1c
to
ec22153
Compare
ec22153
to
c2ee966
Compare
What type of PR is this?
/kind bug
What this PR does / why we need it:
For certain CPU Topolgy(HT enabled, Multi Socket etc. ) with some CPU offline, the CPUsPerSocket/CPUsPerCore fails to conclude the correct result, and further causes misjudgement in isCoreFree/isSocketFree.
The isCoreFree/isSocketFree function should consider real topology, just like what isNumaNodeFree does.
Which issue(s) this PR fixes:
Fixes #124066
Special notes for your reviewer:
Does this PR introduce a user-facing change?
None