🐛 Ensure that webhook server is thread/start-safe #1155

DirectXMan12 · 2020-09-09T02:00:02Z

This ensures that the webhook server is both threadsafe & "start-safe"
-- i.e. you can register webhooks after starting the server. While this
is generally not a common pattern, be allow runnables to be added to the
manager after start, so it makes sense to do the same with hooks & the
server.

Fixes #1148

/kind bug

DirectXMan12 · 2020-09-09T23:02:46Z

/retest

adrianludwin · 2020-09-17T17:35:44Z

Hi, it looks like our workaround to this problem was unsuccessful. Do you have any idea when this might make it into the next release of controller-runtime?

Thanks, A

This ensures that the webhook server is both threadsafe & "start-safe" -- i.e. you can register webhooks after starting the server. While this is generally not a common pattern, be allow runnables to be added to the manager after start, so it makes sense to do the same with hooks & the server.

DirectXMan12 · 2020-09-18T22:13:38Z

Hmm... looks like there's a theoretical race in event broadcaster shutdown due to the event broadcaster calling broadcaster.Action in a goroutine, potentially leading to a send on a closed channel. I filed an issue, but I don't think this is caused by this PR. Upstream issue is kubernetes/kubernetes#94906.

race detector output

==================
WARNING: DATA RACE
Write at 0x00c0001762b0 by goroutine 171:
  runtime.closechan()
      /usr/local/go/src/runtime/chan.go:334 +0x0
  k8s.io/client-go/tools/record.(*eventBroadcasterImpl).Shutdown()
      /home/prow/go/pkg/mod/k8s.io/apimachinery@v0.19.0/pkg/watch/mux.go:199 +0x68
  sigs.k8s.io/controller-runtime/pkg/internal/recorder.(*Provider).Stop.func1()
      /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/internal/recorder/recorder.go:73 +0xa0
Previous read at 0x00c0001762b0 by goroutine 168:
  runtime.chansend()
      /usr/local/go/src/runtime/chan.go:142 +0x0
  k8s.io/client-go/tools/record.(*recorderImpl).generateEvent.func1()
      /home/prow/go/pkg/mod/k8s.io/apimachinery@v0.19.0/pkg/watch/mux.go:189 +0x11b
Goroutine 171 (running) created at:
  sigs.k8s.io/controller-runtime/pkg/internal/recorder.(*Provider).Stop()
      /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/internal/recorder/recorder.go:66 +0x8c
  sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure()
      /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:563 +0x38d
  sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).Start.func1()
      /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:468 +0x49
  sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).Start()
      /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:518 +0x339
  sigs.k8s.io/controller-runtime/pkg/manager.glob..func3.3.8.3.6()
      /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/manager_test.go:273 +0x22b
Goroutine 168 (finished) created at:
  k8s.io/client-go/tools/record.(*recorderImpl).generateEvent()
      /home/prow/go/pkg/mod/k8s.io/client-go@v0.19.0/tools/record/event.go:341 +0x42a
  k8s.io/client-go/tools/record.(*recorderImpl).Event()
      /home/prow/go/pkg/mod/k8s.io/client-go@v0.19.0/tools/record/event.go:349 +0xcf
  k8s.io/client-go/tools/record.(*recorderImpl).Eventf()
      /home/prow/go/pkg/mod/k8s.io/client-go@v0.19.0/tools/record/event.go:353 +0xd7
  sigs.k8s.io/controller-runtime/pkg/internal/recorder.(*lazyRecorder).Eventf()
      /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/internal/recorder/recorder.go:151 +0xf3
  k8s.io/client-go/tools/leaderelection/resourcelock.(*LeaseLock).RecordEvent()
      /home/prow/go/pkg/mod/k8s.io/client-go@v0.19.0/tools/leaderelection/resourcelock/leaselock.go:85 +0x30b
  k8s.io/client-go/tools/leaderelection/resourcelock.(*MultiLock).RecordEvent()
      /home/prow/go/pkg/mod/k8s.io/client-go@v0.19.0/tools/leaderelection/resourcelock/multilock.go:88 +0xa5
  k8s.io/client-go/tools/leaderelection.(*LeaderElector).acquire.func1()
      /home/prow/go/pkg/mod/k8s.io/client-go@v0.19.0/tools/leaderelection/leaderelection.go:251 +0x213
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
      /home/prow/go/pkg/mod/k8s.io/apimachinery@v0.19.0/pkg/util/wait/wait.go:155 +0x6f
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
      /home/prow/go/pkg/mod/k8s.io/apimachinery@v0.19.0/pkg/util/wait/wait.go:156 +0xb3
  k8s.io/apimachinery/pkg/util/wait.JitterUntil()
      /home/prow/go/pkg/mod/k8s.io/apimachinery@v0.19.0/pkg/util/wait/wait.go:133 +0x10d
  k8s.io/client-go/tools/leaderelection.(*LeaderElector).acquire()
      /home/prow/go/pkg/mod/k8s.io/client-go@v0.19.0/tools/leaderelection/leaderelection.go:244 +0x2b2
  k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run()
      /home/prow/go/pkg/mod/k8s.io/client-go@v0.19.0/tools/leaderelection/leaderelection.go:203 +0xee
==================

DirectXMan12 · 2020-09-18T22:18:52Z

/retest

@adrianludwin

This essentially reverts kubernetes-retired#1087, which breaks HNC on new clusters that haven't previously had HNC installed. It fixes the nondeterministic crashing problem by patching in kubernetes-sigs/controller-runtime#1155, which has been applied to controller-runtime 0.6.3 in @adrianludwin's repo. This is a temporary hack and will be removed when controller-runtime releases its own fix - likely 0.6.4. Tested: with the reversion of kubernetes-retired#1087 (main.go), HNC can be installed on a fresh cluster again but fails to start up ~50% of the time. With the fix to controller-runtime, it passes on 20/20 startup attempts. Ran e2e tests and got the same result as without this change (four failures).

This essentially reverts kubernetes-retired#1087, which breaks HNC on new clusters that haven't previously had HNC installed. It fixes the nondeterministic crashing problem by patching in kubernetes-sigs/controller-runtime#1155, which has been applied to controller-runtime 0.6.3 in adrianludwin's repo. This is a temporary hack and will be removed when controller-runtime releases its own fix - likely 0.6.4. Tested: with the reversion of kubernetes-retired#1087 (main.go), HNC can be installed on a fresh cluster again but fails to start up ~50% of the time. With the fix to controller-runtime, it passes on 20/20 startup attempts. Ran e2e tests and got the same result as without this change (four failures).

mengqiy · 2020-09-22T01:34:45Z

/lgtm
/approve

k8s-ci-robot · 2020-09-22T01:34:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DirectXMan12, mengqiy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [DirectXMan12,mengqiy]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vincepri · 2020-09-22T12:58:58Z

pkg/envtest/webhook.go

+// PrepWithoutInstalling does the setup parts of Install (populating host-port,
+// setting up CAs, etc), without actually truing to do anything with webhook


Suggested change

// PrepWithoutInstalling does the setup parts of Install (populating host-port,

// setting up CAs, etc), without actually truing to do anything with webhook

// PrepWithoutInstalling does the setup parts of Install (populating host-port,

// setting up CAs, etc), without actually trying to do anything with webhook

vincepri · 2020-09-22T12:59:12Z

pkg/envtest/webhook.go

+// definitions.  This is largely useful for internal testing of
+// controller-runtime, where we need a random host-port & caData for webhook
+// tests, but may be useful in similar scenarios.
+func (o *WebhookInstallOptions) PrepWithoutInstalling() error {


Could we just call this Prepare?

vincepri · 2020-09-22T13:00:37Z

pkg/manager/internal.go

+	server, wasNew := func() (*webhook.Server, bool) {
+		cm.mu.Lock()
+		defer cm.mu.Unlock()
+
+		if cm.webhookServer != nil {
+			return cm.webhookServer, false
+		}
+
 		cm.webhookServer = &webhook.Server{
 			Port:    cm.port,
 			Host:    cm.host,
 			CertDir: cm.certDir,
 		}
-		if err := cm.Add(cm.webhookServer); err != nil {
-			panic("unable to add webhookServer to the controller manager")
+		return cm.webhookServer, true
+	}()
+
+	// only add the server if *we ourselves* just registered it.
+	// Add has its own lock, so just do this separately -- there shouldn't
+	// be a "race" in this lock gap because the condition is the population
+	// of cm.webhookServer, not anything to do with Add.
+	if wasNew {
+		if err := cm.Add(server); err != nil {
+			panic("unable to add webhook server to the controller manager")
 		}
 	}
-	return cm.webhookServer
+	return server


Unless I missed something, we should be able to simplify this method. Why do we need an inner function that's called right away?

The IIFE sets the scope for the defer unlock call, which would otherwise be weird to unlock due to multiple return paths.

vincepri · 2020-09-22T13:01:02Z

pkg/webhook/server.go

+	// mu protects access to the webhook map & setFields for Start, Register, etc
+	mu sync.Mutex


Nit: Define a mutex above the "protected" fields, so code becomes documentation

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 9, 2020

k8s-ci-robot requested review from droot and gerred September 9, 2020 02:00

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 9, 2020

DirectXMan12 mentioned this pull request Sep 9, 2020

Intermittent yet consistent nil pointer issue in /webhook/admission/http.go:84 #1148

Closed

DirectXMan12 mentioned this pull request Sep 9, 2020

✨ (:sparkles, minor) Make it possible to monitor readiness for webhook #1124

Closed

DirectXMan12 force-pushed the bug/webhook-server-threadsafe branch from ff5f036 to 5f1af13 Compare September 18, 2020 21:01

adrianludwin mentioned this pull request Sep 21, 2020

Fix HNC startup issues kubernetes-retired/multi-tenancy#1127

Merged

DirectXMan12 requested review from mengqiy and vincepri and removed request for droot September 22, 2020 00:29

k8s-ci-robot assigned mengqiy Sep 22, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 22, 2020

k8s-ci-robot merged commit 5757a38 into kubernetes-sigs:master Sep 22, 2020

vincepri reviewed Sep 22, 2020

View reviewed changes

adrianludwin mentioned this pull request Sep 22, 2020

Fix thread safety issue in controller-runtime v0.6 #1172

Closed

DirectXMan12 deleted the bug/webhook-server-threadsafe branch September 22, 2020 21:44

adrianludwin mentioned this pull request Oct 22, 2020

🐛 Ensure that webhook server is thread/start-safe #1225

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Ensure that webhook server is thread/start-safe #1155

🐛 Ensure that webhook server is thread/start-safe #1155

DirectXMan12 commented Sep 9, 2020 •

edited

DirectXMan12 commented Sep 9, 2020

adrianludwin commented Sep 17, 2020

DirectXMan12 commented Sep 18, 2020 •

edited

DirectXMan12 commented Sep 18, 2020

mengqiy commented Sep 22, 2020

k8s-ci-robot commented Sep 22, 2020

vincepri Sep 22, 2020

vincepri Sep 22, 2020

vincepri Sep 22, 2020

DirectXMan12 Sep 22, 2020

vincepri Sep 22, 2020

		// PrepWithoutInstalling does the setup parts of Install (populating host-port,
		// setting up CAs, etc), without actually truing to do anything with webhook

		// mu protects access to the webhook map & setFields for Start, Register, etc
		mu sync.Mutex

🐛 Ensure that webhook server is thread/start-safe #1155

🐛 Ensure that webhook server is thread/start-safe #1155

Conversation

DirectXMan12 commented Sep 9, 2020 • edited

DirectXMan12 commented Sep 9, 2020

adrianludwin commented Sep 17, 2020

DirectXMan12 commented Sep 18, 2020 • edited

DirectXMan12 commented Sep 18, 2020

mengqiy commented Sep 22, 2020

k8s-ci-robot commented Sep 22, 2020

vincepri Sep 22, 2020

Choose a reason for hiding this comment

vincepri Sep 22, 2020

Choose a reason for hiding this comment

vincepri Sep 22, 2020

Choose a reason for hiding this comment

DirectXMan12 Sep 22, 2020

Choose a reason for hiding this comment

vincepri Sep 22, 2020

Choose a reason for hiding this comment

DirectXMan12 commented Sep 9, 2020 •

edited

DirectXMan12 commented Sep 18, 2020 •

edited