New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Ensure that webhook server is thread/start-safe #1155
🐛 Ensure that webhook server is thread/start-safe #1155
Conversation
/retest |
Hi, it looks like our workaround to this problem was unsuccessful. Do you have any idea when this might make it into the next release of controller-runtime? Thanks, A |
This ensures that the webhook server is both threadsafe & "start-safe" -- i.e. you can register webhooks after starting the server. While this is generally not a common pattern, be allow runnables to be added to the manager after start, so it makes sense to do the same with hooks & the server.
ff5f036
to
5f1af13
Compare
Hmm... looks like there's a theoretical race in event broadcaster shutdown due to the event broadcaster calling race detector output
|
/retest |
This essentially reverts kubernetes-retired#1087, which breaks HNC on new clusters that haven't previously had HNC installed. It fixes the nondeterministic crashing problem by patching in kubernetes-sigs/controller-runtime#1155, which has been applied to controller-runtime 0.6.3 in @adrianludwin's repo. This is a temporary hack and will be removed when controller-runtime releases its own fix - likely 0.6.4. Tested: with the reversion of kubernetes-retired#1087 (main.go), HNC can be installed on a fresh cluster again but fails to start up ~50% of the time. With the fix to controller-runtime, it passes on 20/20 startup attempts. Ran e2e tests and got the same result as without this change (four failures).
This essentially reverts kubernetes-retired#1087, which breaks HNC on new clusters that haven't previously had HNC installed. It fixes the nondeterministic crashing problem by patching in kubernetes-sigs/controller-runtime#1155, which has been applied to controller-runtime 0.6.3 in adrianludwin's repo. This is a temporary hack and will be removed when controller-runtime releases its own fix - likely 0.6.4. Tested: with the reversion of kubernetes-retired#1087 (main.go), HNC can be installed on a fresh cluster again but fails to start up ~50% of the time. With the fix to controller-runtime, it passes on 20/20 startup attempts. Ran e2e tests and got the same result as without this change (four failures).
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: DirectXMan12, mengqiy The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
// PrepWithoutInstalling does the setup parts of Install (populating host-port, | ||
// setting up CAs, etc), without actually truing to do anything with webhook |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// PrepWithoutInstalling does the setup parts of Install (populating host-port, | |
// setting up CAs, etc), without actually truing to do anything with webhook | |
// PrepWithoutInstalling does the setup parts of Install (populating host-port, | |
// setting up CAs, etc), without actually trying to do anything with webhook |
// definitions. This is largely useful for internal testing of | ||
// controller-runtime, where we need a random host-port & caData for webhook | ||
// tests, but may be useful in similar scenarios. | ||
func (o *WebhookInstallOptions) PrepWithoutInstalling() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just call this Prepare
?
server, wasNew := func() (*webhook.Server, bool) { | ||
cm.mu.Lock() | ||
defer cm.mu.Unlock() | ||
|
||
if cm.webhookServer != nil { | ||
return cm.webhookServer, false | ||
} | ||
|
||
cm.webhookServer = &webhook.Server{ | ||
Port: cm.port, | ||
Host: cm.host, | ||
CertDir: cm.certDir, | ||
} | ||
if err := cm.Add(cm.webhookServer); err != nil { | ||
panic("unable to add webhookServer to the controller manager") | ||
return cm.webhookServer, true | ||
}() | ||
|
||
// only add the server if *we ourselves* just registered it. | ||
// Add has its own lock, so just do this separately -- there shouldn't | ||
// be a "race" in this lock gap because the condition is the population | ||
// of cm.webhookServer, not anything to do with Add. | ||
if wasNew { | ||
if err := cm.Add(server); err != nil { | ||
panic("unable to add webhook server to the controller manager") | ||
} | ||
} | ||
return cm.webhookServer | ||
return server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I missed something, we should be able to simplify this method. Why do we need an inner function that's called right away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IIFE sets the scope for the defer unlock call, which would otherwise be weird to unlock due to multiple return paths.
// mu protects access to the webhook map & setFields for Start, Register, etc | ||
mu sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Define a mutex above the "protected" fields, so code becomes documentation
This ensures that the webhook server is both threadsafe & "start-safe"
-- i.e. you can register webhooks after starting the server. While this
is generally not a common pattern, be allow runnables to be added to the
manager after start, so it makes sense to do the same with hooks & the
server.
Fixes #1148
/kind bug