Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Ensure that webhook server is thread/start-safe #1155

Conversation

DirectXMan12
Copy link
Contributor

@DirectXMan12 DirectXMan12 commented Sep 9, 2020

This ensures that the webhook server is both threadsafe & "start-safe"
-- i.e. you can register webhooks after starting the server. While this
is generally not a common pattern, be allow runnables to be added to the
manager after start, so it makes sense to do the same with hooks & the
server.

Fixes #1148

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 9, 2020
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 9, 2020
@DirectXMan12
Copy link
Contributor Author

/retest

@adrianludwin
Copy link

Hi, it looks like our workaround to this problem was unsuccessful. Do you have any idea when this might make it into the next release of controller-runtime?

Thanks, A

This ensures that the webhook server is both threadsafe & "start-safe"
-- i.e. you can register webhooks after starting the server.  While this
is generally not a common pattern, be allow runnables to be added to the
manager after start, so it makes sense to do the same with hooks & the
server.
@DirectXMan12
Copy link
Contributor Author

DirectXMan12 commented Sep 18, 2020

Hmm... looks like there's a theoretical race in event broadcaster shutdown due to the event broadcaster calling broadcaster.Action in a goroutine, potentially leading to a send on a closed channel. I filed an issue, but I don't think this is caused by this PR. Upstream issue is kubernetes/kubernetes#94906.

race detector output
==================
WARNING: DATA RACE
Write at 0x00c0001762b0 by goroutine 171:
  runtime.closechan()
      /usr/local/go/src/runtime/chan.go:334 +0x0
  k8s.io/client-go/tools/record.(*eventBroadcasterImpl).Shutdown()
      /home/prow/go/pkg/mod/k8s.io/apimachinery@v0.19.0/pkg/watch/mux.go:199 +0x68
  sigs.k8s.io/controller-runtime/pkg/internal/recorder.(*Provider).Stop.func1()
      /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/internal/recorder/recorder.go:73 +0xa0
Previous read at 0x00c0001762b0 by goroutine 168:
  runtime.chansend()
      /usr/local/go/src/runtime/chan.go:142 +0x0
  k8s.io/client-go/tools/record.(*recorderImpl).generateEvent.func1()
      /home/prow/go/pkg/mod/k8s.io/apimachinery@v0.19.0/pkg/watch/mux.go:189 +0x11b
Goroutine 171 (running) created at:
  sigs.k8s.io/controller-runtime/pkg/internal/recorder.(*Provider).Stop()
      /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/internal/recorder/recorder.go:66 +0x8c
  sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure()
      /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:563 +0x38d
  sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).Start.func1()
      /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:468 +0x49
  sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).Start()
      /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:518 +0x339
  sigs.k8s.io/controller-runtime/pkg/manager.glob..func3.3.8.3.6()
      /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/manager_test.go:273 +0x22b
Goroutine 168 (finished) created at:
  k8s.io/client-go/tools/record.(*recorderImpl).generateEvent()
      /home/prow/go/pkg/mod/k8s.io/client-go@v0.19.0/tools/record/event.go:341 +0x42a
  k8s.io/client-go/tools/record.(*recorderImpl).Event()
      /home/prow/go/pkg/mod/k8s.io/client-go@v0.19.0/tools/record/event.go:349 +0xcf
  k8s.io/client-go/tools/record.(*recorderImpl).Eventf()
      /home/prow/go/pkg/mod/k8s.io/client-go@v0.19.0/tools/record/event.go:353 +0xd7
  sigs.k8s.io/controller-runtime/pkg/internal/recorder.(*lazyRecorder).Eventf()
      /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/internal/recorder/recorder.go:151 +0xf3
  k8s.io/client-go/tools/leaderelection/resourcelock.(*LeaseLock).RecordEvent()
      /home/prow/go/pkg/mod/k8s.io/client-go@v0.19.0/tools/leaderelection/resourcelock/leaselock.go:85 +0x30b
  k8s.io/client-go/tools/leaderelection/resourcelock.(*MultiLock).RecordEvent()
      /home/prow/go/pkg/mod/k8s.io/client-go@v0.19.0/tools/leaderelection/resourcelock/multilock.go:88 +0xa5
  k8s.io/client-go/tools/leaderelection.(*LeaderElector).acquire.func1()
      /home/prow/go/pkg/mod/k8s.io/client-go@v0.19.0/tools/leaderelection/leaderelection.go:251 +0x213
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
      /home/prow/go/pkg/mod/k8s.io/apimachinery@v0.19.0/pkg/util/wait/wait.go:155 +0x6f
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
      /home/prow/go/pkg/mod/k8s.io/apimachinery@v0.19.0/pkg/util/wait/wait.go:156 +0xb3
  k8s.io/apimachinery/pkg/util/wait.JitterUntil()
      /home/prow/go/pkg/mod/k8s.io/apimachinery@v0.19.0/pkg/util/wait/wait.go:133 +0x10d
  k8s.io/client-go/tools/leaderelection.(*LeaderElector).acquire()
      /home/prow/go/pkg/mod/k8s.io/client-go@v0.19.0/tools/leaderelection/leaderelection.go:244 +0x2b2
  k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run()
      /home/prow/go/pkg/mod/k8s.io/client-go@v0.19.0/tools/leaderelection/leaderelection.go:203 +0xee
==================

@DirectXMan12
Copy link
Contributor Author

/retest

adrianludwin added a commit to adrianludwin/multi-tenancy that referenced this pull request Sep 21, 2020
This essentially reverts kubernetes-retired#1087, which breaks HNC on new clusters that
haven't previously had HNC installed. It fixes the nondeterministic
crashing problem by patching in
kubernetes-sigs/controller-runtime#1155, which
has been applied to controller-runtime 0.6.3 in @adrianludwin's repo.
This is a temporary hack and will be removed when controller-runtime
releases its own fix - likely 0.6.4.

Tested: with the reversion of kubernetes-retired#1087 (main.go), HNC can be installed on a
fresh cluster again but fails to start up ~50% of the time. With the fix
to controller-runtime, it passes on 20/20 startup attempts. Ran e2e
tests and got the same result as without this change (four failures).
adrianludwin added a commit to adrianludwin/multi-tenancy that referenced this pull request Sep 21, 2020
This essentially reverts kubernetes-retired#1087, which breaks HNC on new clusters that
haven't previously had HNC installed. It fixes the nondeterministic
crashing problem by patching in
kubernetes-sigs/controller-runtime#1155, which
has been applied to controller-runtime 0.6.3 in adrianludwin's repo.
This is a temporary hack and will be removed when controller-runtime
releases its own fix - likely 0.6.4.

Tested: with the reversion of kubernetes-retired#1087 (main.go), HNC can be installed on a
fresh cluster again but fails to start up ~50% of the time. With the fix
to controller-runtime, it passes on 20/20 startup attempts. Ran e2e
tests and got the same result as without this change (four failures).
@DirectXMan12 DirectXMan12 requested review from mengqiy and vincepri and removed request for droot September 22, 2020 00:29
@mengqiy
Copy link
Member

mengqiy commented Sep 22, 2020

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 22, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DirectXMan12, mengqiy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [DirectXMan12,mengqiy]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 5757a38 into kubernetes-sigs:master Sep 22, 2020
Comment on lines +149 to +150
// PrepWithoutInstalling does the setup parts of Install (populating host-port,
// setting up CAs, etc), without actually truing to do anything with webhook
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// PrepWithoutInstalling does the setup parts of Install (populating host-port,
// setting up CAs, etc), without actually truing to do anything with webhook
// PrepWithoutInstalling does the setup parts of Install (populating host-port,
// setting up CAs, etc), without actually trying to do anything with webhook

// definitions. This is largely useful for internal testing of
// controller-runtime, where we need a random host-port & caData for webhook
// tests, but may be useful in similar scenarios.
func (o *WebhookInstallOptions) PrepWithoutInstalling() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just call this Prepare?

Comment on lines +356 to +381
server, wasNew := func() (*webhook.Server, bool) {
cm.mu.Lock()
defer cm.mu.Unlock()

if cm.webhookServer != nil {
return cm.webhookServer, false
}

cm.webhookServer = &webhook.Server{
Port: cm.port,
Host: cm.host,
CertDir: cm.certDir,
}
if err := cm.Add(cm.webhookServer); err != nil {
panic("unable to add webhookServer to the controller manager")
return cm.webhookServer, true
}()

// only add the server if *we ourselves* just registered it.
// Add has its own lock, so just do this separately -- there shouldn't
// be a "race" in this lock gap because the condition is the population
// of cm.webhookServer, not anything to do with Add.
if wasNew {
if err := cm.Add(server); err != nil {
panic("unable to add webhook server to the controller manager")
}
}
return cm.webhookServer
return server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I missed something, we should be able to simplify this method. Why do we need an inner function that's called right away?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IIFE sets the scope for the defer unlock call, which would otherwise be weird to unlock due to multiple return paths.

Comment on lines +80 to +81
// mu protects access to the webhook map & setFields for Start, Register, etc
mu sync.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Define a mutex above the "protected" fields, so code becomes documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Intermittent yet consistent nil pointer issue in /webhook/admission/http.go:84
5 participants