Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: protect concurrent access to PodInfo.Pod #115777

Closed
wants to merge 1 commit into from

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Feb 14, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

The pointer gets updated by informer callbacks in parallel to pod scheduling, without taking any locks. This showed up in a scheduler_perf test case involving dynamic resource allocation because there pod scheduling fails, which (for the first time in those benchmarks?) exercise additional code paths.

There's no obvious lock that could protect this field. Even if there was one, finding all code locations which currently don't hold it when accessing the field would be difficult. Therefore conversion to an atomic.Pointer is used to fix this problem: the type change causes compile errors in (most) users of the field which haven't been updated to call Load resp. Store.

The one exception is when the field is passed to functions taking an interface{}: such code compiles and then fails at runtime.

The race report was (line numbers to not exactly match current master):

$ go test -v -timeout=0 -bench=.*/SchedulingWithInlineResourceClaims/ -race ./test/integration/scheduler_perf

WARNING: DATA RACE
Write at 0x00c000baafc0 by goroutine 25750:
  k8s.io/kubernetes/pkg/scheduler/framework.(*PodInfo).Update()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/types.go:216 +0xd2e
  k8s.io/kubernetes/pkg/scheduler/internal/queue.updatePod()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:903 +0x5a4
  k8s.io/kubernetes/pkg/scheduler/internal/queue.(*PriorityQueue).Update()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:620 +0x582
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).updatePodInSchedulingQueue()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/eventhandlers.go:140 +0x359
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).updatePodInSchedulingQueue-fm()
      <autogenerated>:1 +0x6d
  k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/controller.go:246 +0x8b
  k8s.io/client-go/tools/cache.(*ResourceEventHandlerFuncs).OnUpdate()
      <autogenerated>:1 +0x29
  k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnUpdate()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/controller.go:311 +0xf7
  k8s.io/client-go/tools/cache.(*FilteringResourceEventHandler).OnUpdate()
      <autogenerated>:1 +0x8d
  k8s.io/client-go/tools/cache.(*processorListener).run.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:973 +0x21e
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157 +0x48
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158 +0xce
  k8s.io/apimachinery/pkg/util/wait.JitterUntil()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x10d
  k8s.io/apimachinery/pkg/util/wait.Until()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92 +0x78
  k8s.io/client-go/tools/cache.(*processorListener).run()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:969 +0x18
  k8s.io/client-go/tools/cache.(*processorListener).run-fm()
      <autogenerated>:1 +0x39
  k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:75 +0x73

Previous read at 0x00c000baafc0 by goroutine 25981:
  k8s.io/kubernetes/pkg/scheduler/internal/queue.(*nominator).add()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:1024 +0xda4
  k8s.io/kubernetes/pkg/scheduler/internal/queue.(*nominator).AddNominatedPod()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:841 +0x58
  k8s.io/kubernetes/pkg/scheduler/internal/queue.(*PriorityQueue).AddNominatedPod()
      <autogenerated>:1 +0x63
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).handleSchedulingFailure()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/schedule_one.go:916 +0x1a34
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).handleSchedulingFailure-fm()
      <autogenerated>:1 +0xcc
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).scheduleOne()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/schedule_one.go:97 +0x962
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).scheduleOne-fm()
      <autogenerated>:1 +0x4d
  k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:190 +0x4c
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157 +0x48
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158 +0xce
  k8s.io/apimachinery/pkg/util/wait.JitterUntil()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x10d
  k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:190 +0xaa
  k8s.io/apimachinery/pkg/util/wait.UntilWithContext()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:101 +0x52
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).Run.func1()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/scheduler.go:357 +0x5a

Which issue(s) this PR fixes:

Fixes #115762

Special notes for your reviewer:

This contains one "govet" suppression because of google/go-cmp#310 (comment)

Does this PR introduce a user-facing change?

scheduler: fixed a data race

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 14, 2023
@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 14, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Feb 14, 2023
@Huang-Wei
Copy link
Member

@pohly is there a PR you have your local changes ready and hence I can take a look and reproduce this issue?

I'm a bit skeptical about why DynamicResourceAllocation is unique to trigger this issue, before jumping to the decision of using atomic.Pointer or not.

@pohly
Copy link
Contributor Author

pohly commented Feb 15, 2023

You can reproduce with my dra-integration-tests branch (see https://github.com/kubernetes/kubernetes/compare/master...pohly:dra-integration-tests?expand=1). I've cleaned it up so that general enhancements and fixes come first, then at the end comes the actual dynamic resource allocation testing. I want to get some of those changes merged separately before I create a PR. I also need to get a new klog release done and updated in Kubernetes.

As far as I can tell, none of the existing test cases involve the handling of pod scheduling failures. Therefore they don't hit this particular race (handleSchedulingFailure and AddNominatedPod never called). So it's not dynamic resource allocation per se which causes this, it just exercises additional code paths.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 16, 2023
The pointer gets updated by informer callbacks in parallel to pod scheduling,
without taking any locks. This showed up in a scheduler_perf test case
involving dynamic resource allocation because there pod scheduling fails,
which (for the first time in those benchmarks?) exercise additional code paths.

There's no obvious lock that could protect this field. Even if there was one,
finding all code locations which currently don't hold it when accessing the
field would be difficult. Therefore conversion to an atomic.Pointer is used to
fix this problem: the type change causes compile errors in (most) users of the
field which haven't been updated to call Load resp. Store.

The one exception is when the field is passed to functions taking an
interface{}: such code compiles and then fails at runtime.

The race report was (line numbers to not exactly match current master):

    WARNING: DATA RACE
    Write at 0x00c000baafc0 by goroutine 25750:
      k8s.io/kubernetes/pkg/scheduler/framework.(*PodInfo).Update()
          /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/types.go:216 +0xd2e
      k8s.io/kubernetes/pkg/scheduler/internal/queue.updatePod()
          /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:903 +0x5a4
      k8s.io/kubernetes/pkg/scheduler/internal/queue.(*PriorityQueue).Update()
          /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:620 +0x582
      k8s.io/kubernetes/pkg/scheduler.(*Scheduler).updatePodInSchedulingQueue()
          /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/eventhandlers.go:140 +0x359
      k8s.io/kubernetes/pkg/scheduler.(*Scheduler).updatePodInSchedulingQueue-fm()
          <autogenerated>:1 +0x6d
      k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate()
          /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/controller.go:246 +0x8b
      k8s.io/client-go/tools/cache.(*ResourceEventHandlerFuncs).OnUpdate()
          <autogenerated>:1 +0x29
      k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnUpdate()
          /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/controller.go:311 +0xf7
      k8s.io/client-go/tools/cache.(*FilteringResourceEventHandler).OnUpdate()
          <autogenerated>:1 +0x8d
      k8s.io/client-go/tools/cache.(*processorListener).run.func1()
          /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:973 +0x21e
      k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
          /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157 +0x48
      k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
          /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158 +0xce
      k8s.io/apimachinery/pkg/util/wait.JitterUntil()
          /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x10d
      k8s.io/apimachinery/pkg/util/wait.Until()
          /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92 +0x78
      k8s.io/client-go/tools/cache.(*processorListener).run()
          /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:969 +0x18
      k8s.io/client-go/tools/cache.(*processorListener).run-fm()
          <autogenerated>:1 +0x39
      k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
          /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:75 +0x73

    Previous read at 0x00c000baafc0 by goroutine 25981:
      k8s.io/kubernetes/pkg/scheduler/internal/queue.(*nominator).add()
          /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:1024 +0xda4
      k8s.io/kubernetes/pkg/scheduler/internal/queue.(*nominator).AddNominatedPod()
          /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:841 +0x58
      k8s.io/kubernetes/pkg/scheduler/internal/queue.(*PriorityQueue).AddNominatedPod()
          <autogenerated>:1 +0x63
      k8s.io/kubernetes/pkg/scheduler.(*Scheduler).handleSchedulingFailure()
          /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/schedule_one.go:916 +0x1a34
      k8s.io/kubernetes/pkg/scheduler.(*Scheduler).handleSchedulingFailure-fm()
          <autogenerated>:1 +0xcc
      k8s.io/kubernetes/pkg/scheduler.(*Scheduler).scheduleOne()
          /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/schedule_one.go:97 +0x962
      k8s.io/kubernetes/pkg/scheduler.(*Scheduler).scheduleOne-fm()
          <autogenerated>:1 +0x4d
      k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
          /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:190 +0x4c
      k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
          /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157 +0x48
      k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
          /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158 +0xce
      k8s.io/apimachinery/pkg/util/wait.JitterUntil()
          /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x10d
      k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext()
          /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:190 +0xaa
      k8s.io/apimachinery/pkg/util/wait.UntilWithContext()
          /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:101 +0x52
      k8s.io/kubernetes/pkg/scheduler.(*Scheduler).Run.func1()
          /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/scheduler.go:357 +0x5a
@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 7, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pohly
Once this PR has been reviewed and has the lgtm label, please assign alculquicondor for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pohly
Copy link
Contributor Author

pohly commented Mar 7, 2023

/retest

Unrelated test flake (TestChangeCRD).

@pohly pohly changed the title WIP: scheduler: protect concurrent access to PodInfo.Pod scheduler: protect concurrent access to PodInfo.Pod Mar 7, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 7, 2023
@Huang-Wei
Copy link
Member

I need some time to check the root cause, and think more about the current version as it involves a lot of places.

This showed up in a scheduler_perf test case involving dynamic resource allocation because there pod scheduling fails, which (for the first time in those benchmarks?) exercise additional code paths.

The perf integration test suite PreemptionBasic should have hit the code paths of concurrent pod update + failure handling.

cc @ahg-g @alculquicondor

@Huang-Wei
Copy link
Member

By looking at the race log:

WARNING: DATA RACE
Write at 0x00c000baafc0 by goroutine 25750:
  k8s.io/kubernetes/pkg/scheduler/framework.(*PodInfo).Update()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/types.go:216 +0xd2e
  k8s.io/kubernetes/pkg/scheduler/internal/queue.updatePod()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:903 +0x5a4
  k8s.io/kubernetes/pkg/scheduler/internal/queue.(*PriorityQueue).Update()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:620 +0x582
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).updatePodInSchedulingQueue()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/eventhandlers.go:140 +0x359
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).updatePodInSchedulingQueue-fm()

It seems it comes from one of the updatePod() call inside:

func (p *PriorityQueue) Update(oldPod, newPod *v1.Pod) error {

@pohly could you elaborate in the DRA integration test, how this update is triggerd? and which fields are updated?

@pohly
Copy link
Contributor Author

pohly commented Mar 8, 2023

I'm not hitting the race with the current code from #116207 (= DRA testcases, no optimizations). I do see it with https://github.com/kubernetes/kubernetes/compare/master...pohly:dra-integration-tests?expand=1. Without those optimizations the timing is vastly different, so perhaps that is hiding the race?

It could of course also mean that some of the additional changes cause it.

I can't say for sure which pod update is the one that triggers the race. When I add fmt.Printf("pod %s updated:\n%s\n", pInfo.Pod.Name, cmp.Diff(pInfo.Pod, newPod)) to updatePod, I get diffs like this:

pod test-dra-169 updated:
  &v1.Pod{
  	TypeMeta: {},
  	ObjectMeta: v1.ObjectMeta{
  		... // 3 identical fields
  		SelfLink:          "",
  		UID:               "2808f4d3-02aa-46f3-a2ab-6cf8824deb42",
- 		ResourceVersion:   "11507",
+ 		ResourceVersion:   "14120",
  		Generation:        0,
  		CreationTimestamp: {Time: s"2023-03-08 15:04:41 +0100 CET"},
  		... // 4 identical fields
  		OwnerReferences: nil,
  		Finalizers:      nil,
  		ManagedFields: []v1.ManagedFieldsEntry{
  			{Manager: "scheduler_perf.test", Operation: "Update", APIVersion: "v1", Time: s"2023-03-08 15:04:41 +0100 CET", ...},
+ 			{
+ 				Manager:     "scheduler_perf.test",
+ 				Operation:   "Update",
+ 				APIVersion:  "v1",
+ 				Time:        s"2023-03-08 15:04:50 +0100 CET",
+ 				FieldsType:  "FieldsV1",
+ 				FieldsV1:    s`{"f:status":{"f:conditions":{".":{},"k:{\"type\":\"PodScheduled\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:me`...,
+ 				Subresource: "status",
+ 			},
  		},
  	},
  	Spec: {Containers: {{Name: "pause", Image: "registry.k8s.io/pause:3.9", Resources: {Claims: {{Name: "resource"}}}, TerminationMessagePath: "/dev/termination-log", ...}}, RestartPolicy: "Always", TerminationGracePeriodSeconds: &30, DNSPolicy: "ClusterFirst", ...},
  	Status: v1.PodStatus{
  		Phase:      "Pending",
- 		Conditions: nil,
+ 		Conditions: []v1.PodCondition{
+ 			{
+ 				Type:               "PodScheduled",
+ 				Status:             "False",
+ 				LastTransitionTime: s"2023-03-08 15:04:50 +0100 CET",
+ 				Reason:             "SchedulerError",
+ 				Message:            `running Reserve plugin "DynamicResources": waiting for resource `...,
+ 			},
+ 		},
  		Message: "",
  		Reason:  "",
  		... // 10 identical fields
  	},
  }

@pohly
Copy link
Contributor Author

pohly commented Mar 8, 2023

I also get a different race report for gRPC logging.

Once #116207 is merged, I intend to investigate how some small-scale test scenarios can be run in a unit test. Then the normal unit testing should run with race detection and we have a chance to detect problems like the one I already fixed without having to check manually.

@pohly
Copy link
Contributor Author

pohly commented Mar 8, 2023

I was able to trigger the same race also with PreemptionBasic and current master (= 0e9ad24). I must admit that I am a bit relieved that it really is not specific to DRA and also occurs without any of my upcoming changes 😅

Not sure whether it matters, but I had added the following:

diff --git a/pkg/scheduler/internal/queue/scheduling_queue.go b/pkg/scheduler/internal/queue/scheduling_queue.go
index 1f639b5c421..806d78f0a68 100644
--- a/pkg/scheduler/internal/queue/scheduling_queue.go
+++ b/pkg/scheduler/internal/queue/scheduling_queue.go
@@ -33,6 +33,7 @@ import (
        "sync"
        "time"
 
+       "github.com/google/go-cmp/cmp"
        v1 "k8s.io/api/core/v1"
        "k8s.io/apimachinery/pkg/labels"
        "k8s.io/apimachinery/pkg/types"
@@ -897,6 +898,7 @@ func (p *PriorityQueue) calculateBackoffDuration(podInfo *framework.QueuedPodInf
 
 func updatePod(oldPodInfo interface{}, newPod *v1.Pod) *framework.QueuedPodInfo {
        pInfo := oldPodInfo.(*framework.QueuedPodInfo)
+       fmt.Printf("pod %s updated:\n%s\n", pInfo.Pod.Name, cmp.Diff(pInfo.Pod, newPod))
        pInfo.Update(newPod)
        return pInfo
 }
$ go test -v -timeout=0 -bench=.*/PreemptionBasic -race -short=true -benchtime=1ns ./test/integration/scheduler_perf
...
pod pod-6t9mj updated:
  &v1.Pod{
        TypeMeta: {},
        ObjectMeta: v1.ObjectMeta{
                ... // 3 identical fields
                SelfLink:          "",
                UID:               "02e593f8-903b-49ea-b19a-c7f2a63d5a46",
-               ResourceVersion:   "6685",
+               ResourceVersion:   "7276",
                Generation:        0,
                CreationTimestamp: {Time: s"2023-03-08 17:53:17 +0100 CET"},
                ... // 4 identical fields
                OwnerReferences: nil,
                Finalizers:      nil,
                ManagedFields: []v1.ManagedFieldsEntry{
                        {Manager: "scheduler_perf.test", Operation: "Update", APIVersion: "v1", Time: s"2023-03-08 17:53:17 +0100 CET", ...},
+                       {
+                               Manager:     "scheduler_perf.test",
+                               Operation:   "Update",
+                               APIVersion:  "v1",
+                               Time:        s"2023-03-08 17:53:17 +0100 CET",
+                               FieldsType:  "FieldsV1",
+                               FieldsV1:    s`{"f:status":{"f:conditions":{".":{},"k:{\"type\":\"PodScheduled\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:
me`...,
+                               Subresource: "status",
+                       },
                },
        },
        Spec: {Containers: {{Name: "pause", Image: "registry.k8s.io/pause:3.9", Ports: {{ContainerPort: 80, Protocol: "TCP"}}, Resources: {Limits: {s"cpu": {i: {...}, s: "3"
, Format: "DecimalSI"}, s"memory": {i: {...}, s: "500Mi", Format: "BinarySI"}}, Requests: {s"cpu": {i: {...}, s: "3", Format: "DecimalSI"}, s"memory": {i: {...}, s: "500Mi",
 Format: "BinarySI"}}}, ...}}, RestartPolicy: "Always", TerminationGracePeriodSeconds: &30, DNSPolicy: "ClusterFirst", ...},
        Status: v1.PodStatus{
                Phase:      "Pending",
-               Conditions: nil,
+               Conditions: []v1.PodCondition{
+                       {
+                               Type:               "PodScheduled",
+                               Status:             "False",
+                               LastTransitionTime: s"2023-03-08 17:53:17 +0100 CET",
+                               Reason:             "Unschedulable",
+                               Message:            "0/500 nodes are available: 500 Insufficient cpu.",
+                       },
+               },
                Message:           "",
                Reason:            "",
-               NominatedNodeName: "",
+               NominatedNodeName: "node-0-8vql9",
                HostIP:            "",
                PodIP:             "",
                ... // 7 identical fields
        },
  }

==================
WARNING: DATA RACE
Write at 0x00c00012db90 by goroutine 4837:
  k8s.io/kubernetes/pkg/scheduler/framework.(*PodInfo).Update()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/types.go:155 +0xd2e
  k8s.io/kubernetes/pkg/scheduler/internal/queue.updatePod()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:902 +0x1a6
  k8s.io/kubernetes/pkg/scheduler/internal/queue.(*PriorityQueue).Update()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:620 +0xfd6
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).updatePodInSchedulingQueue()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/eventhandlers.go:140 +0x359
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).updatePodInSchedulingQueue-fm()
      <autogenerated>:1 +0x6d
  k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/controller.go:246 +0x8b
  k8s.io/client-go/tools/cache.(*ResourceEventHandlerFuncs).OnUpdate()
      <autogenerated>:1 +0x29
  k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnUpdate()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/controller.go:311 +0xf7
  k8s.io/client-go/tools/cache.(*FilteringResourceEventHandler).OnUpdate()
      <autogenerated>:1 +0x8d
  k8s.io/client-go/tools/cache.(*processorListener).run.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:973 +0x21e
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157 +0x48
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158 +0xce
  k8s.io/apimachinery/pkg/util/wait.JitterUntil()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x10d
  k8s.io/apimachinery/pkg/util/wait.Until()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92 +0x78
  k8s.io/client-go/tools/cache.(*processorListener).run()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:969 +0x18
  k8s.io/client-go/tools/cache.(*processorListener).run-fm()
      <autogenerated>:1 +0x39
  k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:75 +0x73

Previous read at 0x00c00012db90 by goroutine 173610:
  k8s.io/kubernetes/pkg/scheduler/framework.(*PodInfo).DeepCopy()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/types.go:141 +0x38
  k8s.io/kubernetes/pkg/scheduler/internal/queue.(*nominator).NominatedPodsForNode()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:851 +0x21a
  k8s.io/kubernetes/pkg/scheduler/framework/runtime.(*frameworkImpl).NominatedPodsForNode()
      <autogenerated>:1 +0x7d
  k8s.io/kubernetes/pkg/scheduler/framework/runtime.addNominatedPods()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/runtime/framework.go:871 +0x105
  k8s.io/kubernetes/pkg/scheduler/framework/runtime.(*frameworkImpl).RunFilterPluginsWithNominatedPods()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/runtime/framework.go:846 +0x174
  k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption.(*DefaultPreemption).SelectVictimsOnNode()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go:188 +0x6cc
  k8s.io/kubernetes/pkg/scheduler/framework/preemption.(*Evaluator).DryRunPreemption.func1()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/preemption/preemption.go:568 +0x206
  k8s.io/kubernetes/pkg/scheduler/framework/parallelize.Parallelizer.Until.func1()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/parallelize/parallelism.go:60 +0x1af
  k8s.io/client-go/util/workqueue.ParallelizeUntil.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:90 +0x1c1

Goroutine 4837 (running) created at:
  k8s.io/apimachinery/pkg/util/wait.(*Group).Start()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73 +0xe4
  k8s.io/client-go/tools/cache.(*sharedProcessor).run.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:799 +0x1e7
  k8s.io/client-go/tools/cache.(*sharedProcessor).run()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:803 +0x51
  k8s.io/client-go/tools/cache.(*sharedProcessor).run-fm()
      <autogenerated>:1 +0x44
  k8s.io/apimachinery/pkg/util/wait.(*Group).StartWithChannel.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:58 +0x3e
  k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:75 +0x73

Goroutine 173610 (finished) created at:
  k8s.io/client-go/util/workqueue.ParallelizeUntil()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:76 +0x30f
  k8s.io/kubernetes/pkg/scheduler/framework/parallelize.Parallelizer.Until()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/parallelize/parallelism.go:63 +0x1b6
  k8s.io/kubernetes/pkg/scheduler/framework/preemption.(*Evaluator).DryRunPreemption()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/preemption/preemption.go:599 +0x755
  k8s.io/kubernetes/pkg/scheduler/framework/preemption.(*Evaluator).findCandidates()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/preemption/preemption.go:244 +0x6ed
  k8s.io/kubernetes/pkg/scheduler/framework/preemption.(*Evaluator).Preempt()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/preemption/preemption.go:171 +0x4d9
  k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption.(*DefaultPreemption).PostFilter()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go:99 +0x244
  k8s.io/kubernetes/pkg/scheduler/framework/runtime.(*frameworkImpl).runPostFilterPlugin()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/runtime/framework.go:801 +0xd1
  k8s.io/kubernetes/pkg/scheduler/framework/runtime.(*frameworkImpl).RunPostFilterPlugins()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/runtime/framework.go:776 +0x42f
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).schedulingCycle()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/schedule_one.go:154 +0x2a6
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).scheduleOne()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/schedule_one.go:95 +0x544
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).scheduleOne-fm()
      <autogenerated>:1 +0x4d
  k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:190 +0x4c
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157 +0x48
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158 +0xce
  k8s.io/apimachinery/pkg/util/wait.JitterUntil()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x10d
  k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:190 +0xaa
  k8s.io/apimachinery/pkg/util/wait.UntilWithContext()
      /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:101 +0x52
  k8s.io/kubernetes/pkg/scheduler.(*Scheduler).Run.func1()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/scheduler.go:358 +0x5a
==================

@alculquicondor
Copy link
Member

I agree that there is a bug:
At any point a Pod could be updated and we are trying to update the fields in QueuedPodInfo in-place.

But I disagree with the solution. Instead, we should instantiate a new QueuedPodInfo and put that one in the scheduling queue. We are not saving any much allocations if we are calling DeepCopy anyways.

@alculquicondor
Copy link
Member

The current approach is going to hide race conditions in other fields of PodInfo.

@alculquicondor
Copy link
Member

alculquicondor commented Mar 8, 2023

Or you could protect QueuedPodInfo.PodInfo with an atomic pointer, instead of Pod itself. Probably worth having a method in PodInfo to get the pod, instead of calling Load everywhere.

@alculquicondor
Copy link
Member

I fear that this problem might have been introduced when we first added QueuedPodInfo. So we will need to cherrypick a solution as far back as possible. cc @ahg-g for thoughts

@pohly let me know if you need a hand.

@pohly
Copy link
Contributor Author

pohly commented Mar 8, 2023

The current approach is going to hide race conditions in other fields of PodInfo.

Why that? If other fields suffer from the same problem, they will show up as a data race the same way as updating Pod did.

But I disagree with the solution. Instead, we should instantiate a new QueuedPodInfo and put that one in the scheduling queue. We are not saving any much allocations if we are calling DeepCopy anyways.

A new PodInfo or QueuedPodInfo?

But either way I am not sure whether that will fix the problem completely. The PodInfo instance being updated by updatePod is going to be read by someone, otherwise why update it? Without updatePod taking some kind of lock during its write, that other read is always going to race with it.

let me know if you need a hand.

I do not fully understand what a different solution would look like yet, so help definitely would be needed right now.

Regarding backporting the current atomic.Pointer change, there will be a snag: "go vet" in older Kubernetes releases will complain about the cmp.Diff transformer, and those older Kubernetes releases have no way of suppressing this.

@alculquicondor
Copy link
Member

Why that? If other fields suffer from the same problem, they will show up as a data race the same way as updating Pod did.

It is possible that a race condition in other fields is harder to reproduce, but there is the same pattern: updatePod changes those fields, and they are read during scheduling.

A new PodInfo or QueuedPodInfo?

and

But either way I am not sure whether that will fix the problem completely.

That's why I suggest having a new QueuedPodInfo. We are already getting a lock for updating the scheduling queue. We are just not locking updates to things inside QueuedPodInfo.

func (p *PriorityQueue) Update(oldPod, newPod *v1.Pod) error {
p.lock.Lock()
defer p.lock.Unlock()

This also allows us to cherrypick without relying on the generic pointer.

@pohly
Copy link
Contributor Author

pohly commented Mar 8, 2023

When taking that lock, why is it necessary to handle PodInfo and QueuedPodInfo instances differently?

@pohly
Copy link
Contributor Author

pohly commented Mar 8, 2023

It is possible that a race condition in other fields is harder to reproduce, but there is the same pattern: updatePod changes those fields, and they are read during scheduling.

So the argument was about the completeness of the fix, not whether it makes other similar problems harder to find. I agree, if such fields exist, then protecting PodInfo as a whole is necessary and better.

@pohly
Copy link
Contributor Author

pohly commented Mar 8, 2023

Is the read path taking that lock already? This here:

Previous read at 0x00c00012db90 by goroutine 173610:
  k8s.io/kubernetes/pkg/scheduler/framework.(*PodInfo).DeepCopy()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/types.go:141 +0x38
  k8s.io/kubernetes/pkg/scheduler/internal/queue.(*nominator).NominatedPodsForNode()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:851 +0x21a
  k8s.io/kubernetes/pkg/scheduler/framework/runtime.(*frameworkImpl).NominatedPodsForNode()
      <autogenerated>:1 +0x7d
  k8s.io/kubernetes/pkg/scheduler/framework/runtime.addNominatedPods()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/scheduler/framework/runtime/framework.go:871 +0x105
...

@alculquicondor
Copy link
Member

When taking that lock, why is it necessary to handle PodInfo and QueuedPodInfo instances differently?

I didn't understand the question.

Is the read path taking that lock already? This here:

Oh.... The problem is that the nominator has its own lock

separate from the PriorityQueue's

That's clearly a problem. It's unclear to me if this is the only problem.

@pohly
Copy link
Contributor Author

pohly commented Mar 9, 2023

I was asking about the read path locking because that question then needs to be answered for all code which reads QueuedPodInfo. NominatedPodsForNode is where we have seen the race. What about the code which works on the QueuedPodInfo as returned by SchedulingQueue.Pop? Will that code be returned a new, unshared copy or is it going to be the same instance that Update writes into? Currently it is the same instance with no locking.

I started protecting the pod field with atomic.Pointer because then (in most cases) the compiler will reject incorrect code. Introducing some additional rules like "access QueuedPodInfo while holding a mutex" will be harder to verify. It may be the right approach, I just wanted to call out the challenge associated with it.

@alculquicondor
Copy link
Member

Those are valid concerns. I think we need to focus on a hotfix for now, specially given that we can't backport the usage of atomic.Pointer.

Pop is not a problem precisely because we are popping it out from the queue. So potential parallel updates while scheduling is happening would create a new object in the queue.

@pohly
Copy link
Contributor Author

pohly commented Mar 9, 2023

#116395 fixes the same problem.

@pohly pohly closed this Mar 9, 2023
@pohly pohly mentioned this pull request Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scheduler: data race between PodInfo.Update and PriorityQueue.AddNominatedPod
4 participants