scheduler: protect concurrent access to PodInfo.Pod #115777

pohly · 2023-02-14T20:37:30Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

The pointer gets updated by informer callbacks in parallel to pod scheduling, without taking any locks. This showed up in a scheduler_perf test case involving dynamic resource allocation because there pod scheduling fails, which (for the first time in those benchmarks?) exercise additional code paths.

There's no obvious lock that could protect this field. Even if there was one, finding all code locations which currently don't hold it when accessing the field would be difficult. Therefore conversion to an atomic.Pointer is used to fix this problem: the type change causes compile errors in (most) users of the field which haven't been updated to call Load resp. Store.

The one exception is when the field is passed to functions taking an interface{}: such code compiles and then fails at runtime.

The race report was (line numbers to not exactly match current master):

$ go test -v -timeout=0 -bench=.*/SchedulingWithInlineResourceClaims/ -race ./test/integration/scheduler_perf

WARNING: DATA RACE
Write at 0x00c000baafc0 by goroutine 25750:
  k8s.io/kubernetes/pkg/scheduler/framework.(*PodInfo).Update()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/types.go:216 +0xd2e
  k8s.io/kubernetes/pkg/scheduler/internal/queue.updatePod()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:903 +0x5a4
  k8s.io/kubernetes/pkg/scheduler/internal/queue.(*PriorityQueue).Update()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:620 +0x582
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).updatePodInSchedulingQueue()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/eventhandlers.go:140 +0x359
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).updatePodInSchedulingQueue-fm()
      <autogenerated>:1 +0x6d
  k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/controller.go:246 +0x8b
  k8s.io/client-go/tools/cache.(*ResourceEventHandlerFuncs).OnUpdate()
      <autogenerated>:1 +0x29
  k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnUpdate()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/controller.go:311 +0xf7
  k8s.io/client-go/tools/cache.(*FilteringResourceEventHandler).OnUpdate()
      <autogenerated>:1 +0x8d
  k8s.io/client-go/tools/cache.(*processorListener).run.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:973 +0x21e
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157 +0x48
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158 +0xce
  k8s.io/apimachinery/pkg/util/wait.JitterUntil()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x10d
  k8s.io/apimachinery/pkg/util/wait.Until()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92 +0x78
  k8s.io/client-go/tools/cache.(*processorListener).run()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:969 +0x18
  k8s.io/client-go/tools/cache.(*processorListener).run-fm()
      <autogenerated>:1 +0x39
  k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:75 +0x73

Previous read at 0x00c000baafc0 by goroutine 25981:
  k8s.io/kubernetes/pkg/scheduler/internal/queue.(*nominator).add()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:1024 +0xda4
  k8s.io/kubernetes/pkg/scheduler/internal/queue.(*nominator).AddNominatedPod()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:841 +0x58
  k8s.io/kubernetes/pkg/scheduler/internal/queue.(*PriorityQueue).AddNominatedPod()
      <autogenerated>:1 +0x63
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).handleSchedulingFailure()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/schedule_one.go:916 +0x1a34
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).handleSchedulingFailure-fm()
      <autogenerated>:1 +0xcc
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).scheduleOne()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/schedule_one.go:97 +0x962
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).scheduleOne-fm()
      <autogenerated>:1 +0x4d
  k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:190 +0x4c
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157 +0x48
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158 +0xce
  k8s.io/apimachinery/pkg/util/wait.JitterUntil()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x10d
  k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:190 +0xaa
  k8s.io/apimachinery/pkg/util/wait.UntilWithContext()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:101 +0x52
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).Run.func1()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/scheduler.go:357 +0x5a

Which issue(s) this PR fixes:

Fixes #115762

Special notes for your reviewer:

This contains one "govet" suppression because of google/go-cmp#310 (comment)

Does this PR introduce a user-facing change?

scheduler: fixed a data race

k8s-ci-robot · 2023-02-14T20:37:38Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Huang-Wei · 2023-02-14T21:23:21Z

@pohly is there a PR you have your local changes ready and hence I can take a look and reproduce this issue?

I'm a bit skeptical about why DynamicResourceAllocation is unique to trigger this issue, before jumping to the decision of using atomic.Pointer or not.

pohly · 2023-02-15T09:39:25Z

You can reproduce with my dra-integration-tests branch (see https://github.com/kubernetes/kubernetes/compare/master...pohly:dra-integration-tests?expand=1). I've cleaned it up so that general enhancements and fixes come first, then at the end comes the actual dynamic resource allocation testing. I want to get some of those changes merged separately before I create a PR. I also need to get a new klog release done and updated in Kubernetes.

As far as I can tell, none of the existing test cases involve the handling of pod scheduling failures. Therefore they don't hit this particular race (handleSchedulingFailure and AddNominatedPod never called). So it's not dynamic resource allocation per se which causes this, it just exercises additional code paths.

The pointer gets updated by informer callbacks in parallel to pod scheduling, without taking any locks. This showed up in a scheduler_perf test case involving dynamic resource allocation because there pod scheduling fails, which (for the first time in those benchmarks?) exercise additional code paths. There's no obvious lock that could protect this field. Even if there was one, finding all code locations which currently don't hold it when accessing the field would be difficult. Therefore conversion to an atomic.Pointer is used to fix this problem: the type change causes compile errors in (most) users of the field which haven't been updated to call Load resp. Store. The one exception is when the field is passed to functions taking an interface{}: such code compiles and then fails at runtime. The race report was (line numbers to not exactly match current master): WARNING: DATA RACE Write at 0x00c000baafc0 by goroutine 25750: k8s.io/kubernetes/pkg/scheduler/framework.(*PodInfo).Update() /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/types.go:216 +0xd2e k8s.io/kubernetes/pkg/scheduler/internal/queue.updatePod() /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:903 +0x5a4 k8s.io/kubernetes/pkg/scheduler/internal/queue.(*PriorityQueue).Update() /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:620 +0x582 k8s.io/kubernetes/pkg/scheduler.(*Scheduler).updatePodInSchedulingQueue() /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/eventhandlers.go:140 +0x359 k8s.io/kubernetes/pkg/scheduler.(*Scheduler).updatePodInSchedulingQueue-fm() <autogenerated>:1 +0x6d k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate() /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/controller.go:246 +0x8b k8s.io/client-go/tools/cache.(*ResourceEventHandlerFuncs).OnUpdate() <autogenerated>:1 +0x29 k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnUpdate() /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/controller.go:311 +0xf7 k8s.io/client-go/tools/cache.(*FilteringResourceEventHandler).OnUpdate() <autogenerated>:1 +0x8d k8s.io/client-go/tools/cache.(*processorListener).run.func1() /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:973 +0x21e k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1() /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157 +0x48 k8s.io/apimachinery/pkg/util/wait.BackoffUntil() /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158 +0xce k8s.io/apimachinery/pkg/util/wait.JitterUntil() /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x10d k8s.io/apimachinery/pkg/util/wait.Until() /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92 +0x78 k8s.io/client-go/tools/cache.(*processorListener).run() /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:969 +0x18 k8s.io/client-go/tools/cache.(*processorListener).run-fm() <autogenerated>:1 +0x39 k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1() /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:75 +0x73 Previous read at 0x00c000baafc0 by goroutine 25981: k8s.io/kubernetes/pkg/scheduler/internal/queue.(*nominator).add() /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:1024 +0xda4 k8s.io/kubernetes/pkg/scheduler/internal/queue.(*nominator).AddNominatedPod() /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:841 +0x58 k8s.io/kubernetes/pkg/scheduler/internal/queue.(*PriorityQueue).AddNominatedPod() <autogenerated>:1 +0x63 k8s.io/kubernetes/pkg/scheduler.(*Scheduler).handleSchedulingFailure() /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/schedule_one.go:916 +0x1a34 k8s.io/kubernetes/pkg/scheduler.(*Scheduler).handleSchedulingFailure-fm() <autogenerated>:1 +0xcc k8s.io/kubernetes/pkg/scheduler.(*Scheduler).scheduleOne() /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/schedule_one.go:97 +0x962 k8s.io/kubernetes/pkg/scheduler.(*Scheduler).scheduleOne-fm() <autogenerated>:1 +0x4d k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1() /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:190 +0x4c k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1() /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157 +0x48 k8s.io/apimachinery/pkg/util/wait.BackoffUntil() /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158 +0xce k8s.io/apimachinery/pkg/util/wait.JitterUntil() /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x10d k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext() /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:190 +0xaa k8s.io/apimachinery/pkg/util/wait.UntilWithContext() /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:101 +0x52 k8s.io/kubernetes/pkg/scheduler.(*Scheduler).Run.func1() /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/scheduler.go:357 +0x5a

k8s-ci-robot · 2023-03-07T19:40:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pohly
Once this PR has been reviewed and has the lgtm label, please assign alculquicondor for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/scheduler/OWNERS
~~test/integration/scheduler/OWNERS~~ [pohly]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pohly · 2023-03-07T20:54:14Z

/retest

Unrelated test flake (TestChangeCRD).

Huang-Wei · 2023-03-08T07:55:35Z

I need some time to check the root cause, and think more about the current version as it involves a lot of places.

This showed up in a scheduler_perf test case involving dynamic resource allocation because there pod scheduling fails, which (for the first time in those benchmarks?) exercise additional code paths.

The perf integration test suite PreemptionBasic should have hit the code paths of concurrent pod update + failure handling.

cc @ahg-g @alculquicondor

Huang-Wei · 2023-03-08T08:24:37Z

By looking at the race log:

WARNING: DATA RACE
Write at 0x00c000baafc0 by goroutine 25750:
  k8s.io/kubernetes/pkg/scheduler/framework.(*PodInfo).Update()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/types.go:216 +0xd2e
  k8s.io/kubernetes/pkg/scheduler/internal/queue.updatePod()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:903 +0x5a4
  k8s.io/kubernetes/pkg/scheduler/internal/queue.(*PriorityQueue).Update()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:620 +0x582
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).updatePodInSchedulingQueue()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/eventhandlers.go:140 +0x359
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).updatePodInSchedulingQueue-fm()

It seems it comes from one of the updatePod() call inside:

kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go

Line 604 in e390791

func (p *PriorityQueue) Update(oldPod, newPod *v1.Pod) error {

@pohly could you elaborate in the DRA integration test, how this update is triggerd? and which fields are updated?

pohly · 2023-03-08T14:07:08Z

I'm not hitting the race with the current code from #116207 (= DRA testcases, no optimizations). I do see it with https://github.com/kubernetes/kubernetes/compare/master...pohly:dra-integration-tests?expand=1. Without those optimizations the timing is vastly different, so perhaps that is hiding the race?

It could of course also mean that some of the additional changes cause it.

I can't say for sure which pod update is the one that triggers the race. When I add fmt.Printf("pod %s updated:\n%s\n", pInfo.Pod.Name, cmp.Diff(pInfo.Pod, newPod)) to updatePod, I get diffs like this:

pod test-dra-169 updated:
  &v1.Pod{
  	TypeMeta: {},
  	ObjectMeta: v1.ObjectMeta{
  		... // 3 identical fields
  		SelfLink:          "",
  		UID:               "2808f4d3-02aa-46f3-a2ab-6cf8824deb42",
- 		ResourceVersion:   "11507",
+ 		ResourceVersion:   "14120",
  		Generation:        0,
  		CreationTimestamp: {Time: s"2023-03-08 15:04:41 +0100 CET"},
  		... // 4 identical fields
  		OwnerReferences: nil,
  		Finalizers:      nil,
  		ManagedFields: []v1.ManagedFieldsEntry{
  			{Manager: "scheduler_perf.test", Operation: "Update", APIVersion: "v1", Time: s"2023-03-08 15:04:41 +0100 CET", ...},
+ 			{
+ 				Manager:     "scheduler_perf.test",
+ 				Operation:   "Update",
+ 				APIVersion:  "v1",
+ 				Time:        s"2023-03-08 15:04:50 +0100 CET",
+ 				FieldsType:  "FieldsV1",
+ 				FieldsV1:    s`{"f:status":{"f:conditions":{".":{},"k:{\"type\":\"PodScheduled\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:me`...,
+ 				Subresource: "status",
+ 			},
  		},
  	},
  	Spec: {Containers: {{Name: "pause", Image: "registry.k8s.io/pause:3.9", Resources: {Claims: {{Name: "resource"}}}, TerminationMessagePath: "/dev/termination-log", ...}}, RestartPolicy: "Always", TerminationGracePeriodSeconds: &30, DNSPolicy: "ClusterFirst", ...},
  	Status: v1.PodStatus{
  		Phase:      "Pending",
- 		Conditions: nil,
+ 		Conditions: []v1.PodCondition{
+ 			{
+ 				Type:               "PodScheduled",
+ 				Status:             "False",
+ 				LastTransitionTime: s"2023-03-08 15:04:50 +0100 CET",
+ 				Reason:             "SchedulerError",
+ 				Message:            `running Reserve plugin "DynamicResources": waiting for resource `...,
+ 			},
+ 		},
  		Message: "",
  		Reason:  "",
  		... // 10 identical fields
  	},
  }

pohly · 2023-03-08T14:12:09Z

I also get a different race report for gRPC logging.

Once #116207 is merged, I intend to investigate how some small-scale test scenarios can be run in a unit test. Then the normal unit testing should run with race detection and we have a chance to detect problems like the one I already fixed without having to check manually.

pohly · 2023-03-08T16:57:51Z

I was able to trigger the same race also with PreemptionBasic and current master (= 0e9ad24). I must admit that I am a bit relieved that it really is not specific to DRA and also occurs without any of my upcoming changes 😅

Not sure whether it matters, but I had added the following:

diff --git a/pkg/scheduler/internal/queue/scheduling_queue.go b/pkg/scheduler/internal/queue/scheduling_queue.go
index 1f639b5c421..806d78f0a68 100644
--- a/pkg/scheduler/internal/queue/scheduling_queue.go
+++ b/pkg/scheduler/internal/queue/scheduling_queue.go
@@ -33,6 +33,7 @@ import (
        "sync"
        "time"
 
+       "github.com/google/go-cmp/cmp"
        v1 "k8s.io/api/core/v1"
        "k8s.io/apimachinery/pkg/labels"
        "k8s.io/apimachinery/pkg/types"
@@ -897,6 +898,7 @@ func (p *PriorityQueue) calculateBackoffDuration(podInfo *framework.QueuedPodInf
 
 func updatePod(oldPodInfo interface{}, newPod *v1.Pod) *framework.QueuedPodInfo {
        pInfo := oldPodInfo.(*framework.QueuedPodInfo)
+       fmt.Printf("pod %s updated:\n%s\n", pInfo.Pod.Name, cmp.Diff(pInfo.Pod, newPod))
        pInfo.Update(newPod)
        return pInfo
 }

$ go test -v -timeout=0 -bench=.*/PreemptionBasic -race -short=true -benchtime=1ns ./test/integration/scheduler_perf
...
pod pod-6t9mj updated:
  &v1.Pod{
        TypeMeta: {},
        ObjectMeta: v1.ObjectMeta{
                ... // 3 identical fields
                SelfLink:          "",
                UID:               "02e593f8-903b-49ea-b19a-c7f2a63d5a46",
-               ResourceVersion:   "6685",
+               ResourceVersion:   "7276",
                Generation:        0,
                CreationTimestamp: {Time: s"2023-03-08 17:53:17 +0100 CET"},
                ... // 4 identical fields
                OwnerReferences: nil,
                Finalizers:      nil,
                ManagedFields: []v1.ManagedFieldsEntry{
                        {Manager: "scheduler_perf.test", Operation: "Update", APIVersion: "v1", Time: s"2023-03-08 17:53:17 +0100 CET", ...},
+                       {
+                               Manager:     "scheduler_perf.test",
+                               Operation:   "Update",
+                               APIVersion:  "v1",
+                               Time:        s"2023-03-08 17:53:17 +0100 CET",
+                               FieldsType:  "FieldsV1",
+                               FieldsV1:    s`{"f:status":{"f:conditions":{".":{},"k:{\"type\":\"PodScheduled\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:
me`...,
+                               Subresource: "status",
+                       },
                },
        },
        Spec: {Containers: {{Name: "pause", Image: "registry.k8s.io/pause:3.9", Ports: {{ContainerPort: 80, Protocol: "TCP"}}, Resources: {Limits: {s"cpu": {i: {...}, s: "3"
, Format: "DecimalSI"}, s"memory": {i: {...}, s: "500Mi", Format: "BinarySI"}}, Requests: {s"cpu": {i: {...}, s: "3", Format: "DecimalSI"}, s"memory": {i: {...}, s: "500Mi",
 Format: "BinarySI"}}}, ...}}, RestartPolicy: "Always", TerminationGracePeriodSeconds: &30, DNSPolicy: "ClusterFirst", ...},
        Status: v1.PodStatus{
                Phase:      "Pending",
-               Conditions: nil,
+               Conditions: []v1.PodCondition{
+                       {
+                               Type:               "PodScheduled",
+                               Status:             "False",
+                               LastTransitionTime: s"2023-03-08 17:53:17 +0100 CET",
+                               Reason:             "Unschedulable",
+                               Message:            "0/500 nodes are available: 500 Insufficient cpu.",
+                       },
+               },
                Message:           "",
                Reason:            "",
-               NominatedNodeName: "",
+               NominatedNodeName: "node-0-8vql9",
                HostIP:            "",
                PodIP:             "",
                ... // 7 identical fields
        },
  }

==================
WARNING: DATA RACE
Write at 0x00c00012db90 by goroutine 4837:
  k8s.io/kubernetes/pkg/scheduler/framework.(*PodInfo).Update()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/types.go:155 +0xd2e
  k8s.io/kubernetes/pkg/scheduler/internal/queue.updatePod()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:902 +0x1a6
  k8s.io/kubernetes/pkg/scheduler/internal/queue.(*PriorityQueue).Update()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:620 +0xfd6
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).updatePodInSchedulingQueue()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/eventhandlers.go:140 +0x359
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).updatePodInSchedulingQueue-fm()
      <autogenerated>:1 +0x6d
  k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/controller.go:246 +0x8b
  k8s.io/client-go/tools/cache.(*ResourceEventHandlerFuncs).OnUpdate()
      <autogenerated>:1 +0x29
  k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnUpdate()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/controller.go:311 +0xf7
  k8s.io/client-go/tools/cache.(*FilteringResourceEventHandler).OnUpdate()
      <autogenerated>:1 +0x8d
  k8s.io/client-go/tools/cache.(*processorListener).run.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:973 +0x21e
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157 +0x48
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158 +0xce
  k8s.io/apimachinery/pkg/util/wait.JitterUntil()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x10d
  k8s.io/apimachinery/pkg/util/wait.Until()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92 +0x78
  k8s.io/client-go/tools/cache.(*processorListener).run()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:969 +0x18
  k8s.io/client-go/tools/cache.(*processorListener).run-fm()
      <autogenerated>:1 +0x39
  k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:75 +0x73

Previous read at 0x00c00012db90 by goroutine 173610:
  k8s.io/kubernetes/pkg/scheduler/framework.(*PodInfo).DeepCopy()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/types.go:141 +0x38
  k8s.io/kubernetes/pkg/scheduler/internal/queue.(*nominator).NominatedPodsForNode()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:851 +0x21a
  k8s.io/kubernetes/pkg/scheduler/framework/runtime.(*frameworkImpl).NominatedPodsForNode()
      <autogenerated>:1 +0x7d
  k8s.io/kubernetes/pkg/scheduler/framework/runtime.addNominatedPods()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/runtime/framework.go:871 +0x105
  k8s.io/kubernetes/pkg/scheduler/framework/runtime.(*frameworkImpl).RunFilterPluginsWithNominatedPods()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/runtime/framework.go:846 +0x174
  k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption.(*DefaultPreemption).SelectVictimsOnNode()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go:188 +0x6cc
  k8s.io/kubernetes/pkg/scheduler/framework/preemption.(*Evaluator).DryRunPreemption.func1()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/preemption/preemption.go:568 +0x206
  k8s.io/kubernetes/pkg/scheduler/framework/parallelize.Parallelizer.Until.func1()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/parallelize/parallelism.go:60 +0x1af
  k8s.io/client-go/util/workqueue.ParallelizeUntil.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:90 +0x1c1

Goroutine 4837 (running) created at:
  k8s.io/apimachinery/pkg/util/wait.(*Group).Start()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73 +0xe4
  k8s.io/client-go/tools/cache.(*sharedProcessor).run.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:799 +0x1e7
  k8s.io/client-go/tools/cache.(*sharedProcessor).run()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:803 +0x51
  k8s.io/client-go/tools/cache.(*sharedProcessor).run-fm()
      <autogenerated>:1 +0x44
  k8s.io/apimachinery/pkg/util/wait.(*Group).StartWithChannel.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:58 +0x3e
  k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:75 +0x73

Goroutine 173610 (finished) created at:
  k8s.io/client-go/util/workqueue.ParallelizeUntil()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:76 +0x30f
  k8s.io/kubernetes/pkg/scheduler/framework/parallelize.Parallelizer.Until()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/parallelize/parallelism.go:63 +0x1b6
  k8s.io/kubernetes/pkg/scheduler/framework/preemption.(*Evaluator).DryRunPreemption()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/preemption/preemption.go:599 +0x755
  k8s.io/kubernetes/pkg/scheduler/framework/preemption.(*Evaluator).findCandidates()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/preemption/preemption.go:244 +0x6ed
  k8s.io/kubernetes/pkg/scheduler/framework/preemption.(*Evaluator).Preempt()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/preemption/preemption.go:171 +0x4d9
  k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption.(*DefaultPreemption).PostFilter()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go:99 +0x244
  k8s.io/kubernetes/pkg/scheduler/framework/runtime.(*frameworkImpl).runPostFilterPlugin()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/runtime/framework.go:801 +0xd1
  k8s.io/kubernetes/pkg/scheduler/framework/runtime.(*frameworkImpl).RunPostFilterPlugins()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/runtime/framework.go:776 +0x42f
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).schedulingCycle()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/schedule_one.go:154 +0x2a6
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).scheduleOne()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/schedule_one.go:95 +0x544
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).scheduleOne-fm()
      <autogenerated>:1 +0x4d
  k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:190 +0x4c
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157 +0x48
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158 +0xce
  k8s.io/apimachinery/pkg/util/wait.JitterUntil()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x10d
  k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:190 +0xaa
  k8s.io/apimachinery/pkg/util/wait.UntilWithContext()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:101 +0x52
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).Run.func1()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/scheduler.go:358 +0x5a
==================

alculquicondor · 2023-03-08T17:16:39Z

I agree that there is a bug:
At any point a Pod could be updated and we are trying to update the fields in QueuedPodInfo in-place.

But I disagree with the solution. Instead, we should instantiate a new QueuedPodInfo and put that one in the scheduling queue. We are not saving any much allocations if we are calling DeepCopy anyways.

alculquicondor · 2023-03-08T17:18:36Z

The current approach is going to hide race conditions in other fields of PodInfo.

alculquicondor · 2023-03-08T17:22:39Z

Or you could protect QueuedPodInfo.PodInfo with an atomic pointer, instead of Pod itself. Probably worth having a method in PodInfo to get the pod, instead of calling Load everywhere.

alculquicondor · 2023-03-08T17:23:47Z

I fear that this problem might have been introduced when we first added QueuedPodInfo. So we will need to cherrypick a solution as far back as possible. cc @ahg-g for thoughts

@pohly let me know if you need a hand.

pohly · 2023-03-08T18:10:53Z

The current approach is going to hide race conditions in other fields of PodInfo.

Why that? If other fields suffer from the same problem, they will show up as a data race the same way as updating Pod did.

But I disagree with the solution. Instead, we should instantiate a new QueuedPodInfo and put that one in the scheduling queue. We are not saving any much allocations if we are calling DeepCopy anyways.

A new PodInfo or QueuedPodInfo?

But either way I am not sure whether that will fix the problem completely. The PodInfo instance being updated by updatePod is going to be read by someone, otherwise why update it? Without updatePod taking some kind of lock during its write, that other read is always going to race with it.

let me know if you need a hand.

I do not fully understand what a different solution would look like yet, so help definitely would be needed right now.

Regarding backporting the current atomic.Pointer change, there will be a snag: "go vet" in older Kubernetes releases will complain about the cmp.Diff transformer, and those older Kubernetes releases have no way of suppressing this.

alculquicondor · 2023-03-08T18:18:52Z

Why that? If other fields suffer from the same problem, they will show up as a data race the same way as updating Pod did.

It is possible that a race condition in other fields is harder to reproduce, but there is the same pattern: updatePod changes those fields, and they are read during scheduling.

A new PodInfo or QueuedPodInfo?

and

But either way I am not sure whether that will fix the problem completely.

That's why I suggest having a new QueuedPodInfo. We are already getting a lock for updating the scheduling queue. We are just not locking updates to things inside QueuedPodInfo.

kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go

Lines 604 to 606 in 3307b39

    
           func (p *PriorityQueue) Update(oldPod, newPod *v1.Pod) error { 
        
           	p.lock.Lock() 
        
           	defer p.lock.Unlock()

This also allows us to cherrypick without relying on the generic pointer.

pohly · 2023-03-08T18:39:48Z

When taking that lock, why is it necessary to handle PodInfo and QueuedPodInfo instances differently?

pohly · 2023-03-08T18:43:06Z

It is possible that a race condition in other fields is harder to reproduce, but there is the same pattern: updatePod changes those fields, and they are read during scheduling.

So the argument was about the completeness of the fix, not whether it makes other similar problems harder to find. I agree, if such fields exist, then protecting PodInfo as a whole is necessary and better.

pohly · 2023-03-08T18:45:08Z

Is the read path taking that lock already? This here:

Previous read at 0x00c00012db90 by goroutine 173610:
  k8s.io/kubernetes/pkg/scheduler/framework.(*PodInfo).DeepCopy()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/types.go:141 +0x38
  k8s.io/kubernetes/pkg/scheduler/internal/queue.(*nominator).NominatedPodsForNode()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:851 +0x21a
  k8s.io/kubernetes/pkg/scheduler/framework/runtime.(*frameworkImpl).NominatedPodsForNode()
      <autogenerated>:1 +0x7d
  k8s.io/kubernetes/pkg/scheduler/framework/runtime.addNominatedPods()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/runtime/framework.go:871 +0x105
...

alculquicondor · 2023-03-08T19:21:52Z

When taking that lock, why is it necessary to handle PodInfo and QueuedPodInfo instances differently?

I didn't understand the question.

Is the read path taking that lock already? This here:

Oh.... The problem is that the nominator has its own lock

kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go

Line 988 in 3307b39

sync.RWMutex

separate from the PriorityQueue's

kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go

Line 159 in 3307b39

lock sync.RWMutex

That's clearly a problem. It's unclear to me if this is the only problem.

pohly · 2023-03-09T07:03:05Z

I was asking about the read path locking because that question then needs to be answered for all code which reads QueuedPodInfo. NominatedPodsForNode is where we have seen the race. What about the code which works on the QueuedPodInfo as returned by SchedulingQueue.Pop? Will that code be returned a new, unshared copy or is it going to be the same instance that Update writes into? Currently it is the same instance with no locking.

I started protecting the pod field with atomic.Pointer because then (in most cases) the compiler will reject incorrect code. Introducing some additional rules like "access QueuedPodInfo while holding a mutex" will be harder to verify. It may be the right approach, I just wanted to call out the challenge associated with it.

alculquicondor · 2023-03-09T15:10:03Z

Those are valid concerns. I think we need to focus on a hotfix for now, specially given that we can't backport the usage of atomic.Pointer.

Pop is not a problem precisely because we are popping it out from the queue. So potential parallel updates while scheduling is happening would create a new object in the queue.

pohly · 2023-03-09T20:36:41Z

#116395 fixes the same problem.

k8s-ci-robot requested review from denkensk and kerthcet February 14, 2023 20:37

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 14, 2023

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Feb 14, 2023

pohly mentioned this pull request Feb 15, 2023

scheduler: data race between PodInfo.Update and PriorityQueue.AddNominatedPod #115762

Closed

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 16, 2023

pohly force-pushed the scheduler-podinfo-race branch from 14e12f2 to 0285030 Compare March 7, 2023 19:39

k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 7, 2023

pohly changed the title ~~WIP: scheduler: protect concurrent access to PodInfo.Pod~~ scheduler: protect concurrent access to PodInfo.Pod Mar 7, 2023

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 7, 2023

pohly closed this Mar 9, 2023

pohly mentioned this pull request Aug 1, 2023

scheduler: data race #119712

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: protect concurrent access to PodInfo.Pod #115777

scheduler: protect concurrent access to PodInfo.Pod #115777

pohly commented Feb 14, 2023 •

edited

k8s-ci-robot commented Feb 14, 2023

Huang-Wei commented Feb 14, 2023

pohly commented Feb 15, 2023

k8s-ci-robot commented Mar 7, 2023

pohly commented Mar 7, 2023

Huang-Wei commented Mar 8, 2023

Huang-Wei commented Mar 8, 2023

pohly commented Mar 8, 2023

pohly commented Mar 8, 2023

pohly commented Mar 8, 2023

alculquicondor commented Mar 8, 2023

alculquicondor commented Mar 8, 2023

alculquicondor commented Mar 8, 2023 •

edited

alculquicondor commented Mar 8, 2023

pohly commented Mar 8, 2023

alculquicondor commented Mar 8, 2023

pohly commented Mar 8, 2023

pohly commented Mar 8, 2023

pohly commented Mar 8, 2023

alculquicondor commented Mar 8, 2023

pohly commented Mar 9, 2023

alculquicondor commented Mar 9, 2023

pohly commented Mar 9, 2023

scheduler: protect concurrent access to PodInfo.Pod #115777

scheduler: protect concurrent access to PodInfo.Pod #115777

Conversation

pohly commented Feb 14, 2023 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Feb 14, 2023

Huang-Wei commented Feb 14, 2023

pohly commented Feb 15, 2023

k8s-ci-robot commented Mar 7, 2023

pohly commented Mar 7, 2023

Huang-Wei commented Mar 8, 2023

Huang-Wei commented Mar 8, 2023

pohly commented Mar 8, 2023

pohly commented Mar 8, 2023

pohly commented Mar 8, 2023

alculquicondor commented Mar 8, 2023

alculquicondor commented Mar 8, 2023

alculquicondor commented Mar 8, 2023 • edited

alculquicondor commented Mar 8, 2023

pohly commented Mar 8, 2023

alculquicondor commented Mar 8, 2023

pohly commented Mar 8, 2023

pohly commented Mar 8, 2023

pohly commented Mar 8, 2023

alculquicondor commented Mar 8, 2023

pohly commented Mar 9, 2023

alculquicondor commented Mar 9, 2023

pohly commented Mar 9, 2023

pohly commented Feb 14, 2023 •

edited

alculquicondor commented Mar 8, 2023 •

edited