New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler throughput reduced when many gated pods #124384
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig scheduling |
@Huang-Wei, do you remember what the justification was to put the gated pods in the same unschedulable pool of pods? I think we should just keep them entirely separate, so that gated pods truly don't affect scheduling throughput. |
/cc |
I don't think we need to take a lock before checking |
I tried to create 10,000 gated-pods, and then whether I changed the gated-pods to non-gated-pods or created normal pods, I did not find that the throughput was reduced. Is there something I am missing? I think the scheduling is well. This should not be a bug, because if you want lock to affect throughput, the number of pods should be much more than 10,000. |
It's mainly for keeping the queueing code maintainable. Given the transition among activeQ/backoffQ/unschedulableQ is already complex, adding another internal queue doesn't sound a good idea by that time.
I doubt if throughput is really impacted... In terms of the re-queuing check on gated pods, (I haven't went through the latest code w/ schedulingHint), the original version is fairly straightforward - a PodDeletion event won't interest the gated pod at all, so it simply return. Even if the throughput is degraded, it's suspicious that the dominant factor is iterating gated pods. Needless to say, all re-queueing check happens in a separate goroutine. So if throughput is proven impacted, I'd suggest to narrow down the real problem; otherwise moving gated pods to a separate struct won't benefit that much.
We should pin-point the specific issue by crafting a reproducible test, and then work on the solution. Maybe it's lock, maybe it's other things. |
60s pprof sample with same experiment setup as comment directly above, after pods started terminating: 61.7% of samples are from deletePodFromCache, with 55.3% attributable to these 4 lines:
The smallest change would be to filter gated pods inline here. Could also use a preCheck function, but that is more than one line :) /assign gabesaba |
The point is that this issue shows a general possibility where too many Pods in unschedQ can have a negative impact on scheduling cycles. So, I'd suggest moving |
I agree. Just note that the motivation for scheduling gates is to be able to have a lot of gates pods without affecting the scheduling performance. So this issue shows a weakness of the current implementation. OTOH, having a lot of pending pods has always been a problem.
Are you sure that is possible? In order to run |
I think maybe what sanposhiho means is to release lock before Maybe something like this:
We only reading |
Right, I get that. My point is that it sounds very risky. Definitely not a change that I would be comfortable backporting. And to make the change in master, I would like to see integration tests that exercise parallel pod deletions and creations. But I'm not really sure how the test should look like. A fix targeting gated pods specifically can be much less risky. |
It seems the major contributing factors are related w/ queueingHint's (core) bits. Which means although the feature gate is disabled, the core logic cannot be disabled. Given it's not easy to switch back to the moment queueingHint core logic was not introduced, I agree with Aldo that handle gatedPods specifically is a portable fix. @gabesaba FYI, the original (v1.27) impl. is here, which merely depends on a lightweight podMatchesEvent() function. |
OK, so we're going to cherry-pick. Then I agree with a quick patch for gated pods proposed earlier. |
/assign I'll explorer a general fix, including tests suggested by Aldo. |
So what's our conclusion now, Make a quick fix with separate gated pods from unschedulable pool, and in the mean time, trying to find a general fix ? |
Btw: i think the title is a bit vague - throughout meant the ratio that main scheudling goroutine handles pods; while in this issue, it's more on the ratio of how scheduling queue moves pods. @alculquicondor @sanposhiho @gabesaba I'd suggest rewording the title to avoid confusion. |
Though this issue is impacting the rate at which we schedule pods - see the graphs in #124384 (comment). Perhaps due (edit: see #124384 (comment) below) |
We observed increased scheduling latency. And this is because both the event handlers and the scheduling cycle need to access the lock (once for getting the head of the queue, and once for assuming). |
We acquire the scheduling queue's lock in these two places - though the 2nd is probably irrelevant as it is done async kubernetes/pkg/scheduler/schedule_one.go Line 69 in 0011756
kubernetes/pkg/scheduler/schedule_one.go Line 133 in 0011756
|
From my view: This problem highlights two aspects:
Based on them, action items:
(Probably I should create another separated issue for second one(s), rather putting too much stuff in this issue.) |
TBH, we don't have proof of that. We didn't test 1.25. But it's still worth investigating how to improve the performance overall.
I agree with all of this. TBH, profiling seems to be the best tool for finding the hotspots. And then metrics can help us monitor in the future.
Maybe this is good step forward, and we could potentially guard it with the SchedulingHints feature gate. |
and also the duration each |
Ah, okay. I thought it was a degradation from the past version.
Agree with all. |
created: #124566 |
Oh wait, was 1.28 the version that added hints for the first time? I got confused with this line https://github.com/kubernetes/enhancements/blob/3feb698013b4c227686a972e1da7e93e5fd0dcee/keps/sig-scheduling/4247-queueinghint/kep.yaml#L24 |
I reran the experiment from #124384 (comment) with v1.27.11-gke.1062000, and indeed saw better performance (still observed some slowdown) |
/reopen I'll work on a general fix mentioned in #124384 (comment). |
@sanposhiho: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What happened?
When handling a delete pod event, we attempt to remove pods from the unscheduablePods pool [1], taking a lock while doing so [2]. Gated pods reside in this unscheduable pool, and we process all of them for each delete event. Scheduling throughput is affected, likely because ScheduleOne also takes a lock [3]
[1]
kubernetes/pkg/scheduler/eventhandlers.go
Lines 244 to 249 in 0011756
[2]
kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go
Lines 1093 to 1097 in 0011756
[3]
kubernetes/pkg/scheduler/schedule_one.go
Line 69 in 0011756
What did you expect to happen?
Scheduling throughput should not reduce when there are many gated pods. We eventually decide against moving these pods, but after some processing.
kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go
Lines 1178 to 1183 in be4b717
How can we reproduce it (as minimally and precisely as possible)?
Create many (~10000) gated pods. Then, create non-gated pods which eventually terminate. As these pods finish and pod delete events are processed, scheduling throughput decreases
Anything else we need to know?
No response
Kubernetes version
Cloud provider
OS version
No response
Install tools
No response
Container runtime (CRI) and version (if applicable)
No response
Related plugins (CNI, CSI, ...) and versions (if applicable)
No response
The text was updated successfully, but these errors were encountered: