Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add support for controller-manager webhook shutdown delay #2601

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

dippynark
Copy link

@dippynark dippynark commented Dec 2, 2023

Currently the controller-runtime signal handler cancels the context as soon as a termination signal is received:

// SetupSignalHandler registers for SIGTERM and SIGINT. A context is returned
// which is canceled on one of these signals. If a second signal is caught, the program
// is terminated with exit code 1.
func SetupSignalHandler() context.Context {

This is problematic for controller-managers managing a webhook server since graceful webhook server shutdown may begin before kube-proxy has observed the termination (and reconciled local iptables rules) leading to connection attempts after webhook server listeners have already been closed.

This PR adds support for a ShutdownDelay webhook server option allowing users to delay webhook server shutdown and give time for routes to be updated so that clients connect to other server replicas before closing server listeners. In particular, on GKE the kube-proxy --iptables-min-sync-period flag is currently set to 10 seconds, so a delay duration of 15 seconds might make sense on such clusters.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dippynark
Once this PR has been reviewed and has the lgtm label, please assign joelanford for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 2, 2023
@k8s-ci-robot
Copy link
Contributor

Welcome @dippynark!

It looks like this is your first PR to kubernetes-sigs/controller-runtime 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/controller-runtime has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 2, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @dippynark. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Dec 2, 2023
Signed-off-by: Luke Addison <lukeaddison785@gmail.com>
Signed-off-by: Luke Addison <lukeaddison785@gmail.com>
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 2, 2023
// of this manager before starting shutdown of any webhook servers to avoid
// receiving connection attempts after closing webhook listeners. If a second
// signal is caught, the program is terminated with exit code 1.
func SetupSignalHandlerWithDelay(delay time.Duration) context.Context {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this? IMHO this doesn't make sense because if this is needed or not is not a property of the binary, but of the environment. Kubernetes allows to use a preStopHook for this and recently even merged native sleep support for this precise use-case: kubernetes/kubernetes#119026

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nice, I wasn't aware of this sleep lifecycle hook feature! I agree with this but until Kubernetes v1.28 (I think the first version where the sleep lifecycle hook is available) the application otherwise needs to ship with a sleep binary so I think it'd still be worth having to deal with the long tail of Kubernetes versions.

However, happy for this to be closed and wait for the sleep hook for a native solution!

@inteon
Copy link
Member

inteon commented Dec 3, 2023

@dippynark Should we not create a webhook-specific solution?

I do remember reading a great article on how to properly shutdown a webserver on Kubernetes; unfortunately, I cannot find it anymore.
I do remember reading that new connections should still be accepted, but old connections should start being drained when you receive a shutdown signal. After a certain amount of time, the server can stop accepting new connections and fully shutdown.
It would be great if controller-runtime could apply some of these best-practices by default (only to its webhooks). I'm not sure just delaying the context cancel is the most elegant solution.

I think in go-world this translates to:

  1. call server.SetKeepAlivesEnabled(false) on context cancel
  2. call server.Shutdown(shutdownCtx) after timeout A (now no new connections are accepted)
  3. shutdownCtx is canceled after timeout B (now all remaining connections are forcefully closed)

@dippynark
Copy link
Author

@inteon Nice! When I looked at this I couldn't see how to add some time between triggering shutdown and closing listeners: https://github.com/golang/go/blob/de5b418bea70aaf27de1f47e9b5813940d1e15a4/src/net/http/server.go#L2986-L2989

But didn't know about SetKeepAlivesEnabled! Will have a go at doing shutdown this way

@inteon
Copy link
Member

inteon commented Dec 3, 2023

@inteon Nice! When I looked at this I couldn't see how to add some time between triggering shutdown and closing listeners: https://github.com/golang/go/blob/de5b418bea70aaf27de1f47e9b5813940d1e15a4/src/net/http/server.go#L2986-L2989

But didn't know about SetKeepAlivesEnabled! Will have a go at doing shutdown this way

@dippynark I created a POC here: #2607. Feel free to copy that code or leave feedback on that PR.

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 3, 2023
@dippynark
Copy link
Author

dippynark commented Dec 3, 2023

@inteon Ah just saw your POC, have also updated this PR. I think we want the delay to be configurable though, because the amount of time we want to wait depends on the kube-proxy --iptables-min-sync-period flag (e.g. on GKE it's set to 10 seconds so we'd want to wait for more than 10 seconds). I kept the default at 0 for backwards compatibility (but can change the default to 10 seconds if that is preferred).

I have also not disabled keep-alives before waiting for the shutdown delay because if clients are still trying to open new connections and we disable keep-alives then those connections will only last for a single HTTP request (and there is nothing stopping the client opening a new connection to the same server that is about to be shut down). Instead, I think we want to only wait and keep the server running as normal until clients stop trying to open new connections and then when Shutdown is called the HTTP standard library already disables keep-alives and waits for active connections to close: https://github.com/golang/go/blob/de5b418bea70aaf27de1f47e9b5813940d1e15a4/src/net/http/server.go#L3385

Also, do we definitely want to stop the readiness probe? If users are using the same endpoint for liveness then the server might be killed early. IMO we should leave the health check alone and rely on the termination of the Pod to signal Kubernetes to stop sending traffic (but not too sure what situations you were thinking of where Kubernetes doesn't know about it).

I can copy any code from your POC PR depending on what you think of the above (I haven't done a controller-runtime PR before so would prefer to merge this one if it's all the same to you 😄)

@dippynark dippynark changed the title ✨ Add support for controller-manager shutdown delay ✨ Add support for controller-manager webhook shutdown delay Dec 3, 2023
@inteon
Copy link
Member

inteon commented Dec 3, 2023

I do agree with not making the readiness/ liveness probe fail. I think most controllers will only shutdown because Kubernetes told it to shutdown. The only situation in which this might be useful is if there was some kind of critical failure that forces the pod to restart. Currently c/r does not provide a clear readiness <-> liveness split for this probe, so it will likely be added to the liveness probe too, which is undesirable.

I do think however that the SetKeepAlivesEnabled(false) call is useful. If the routing has been updated before the --iptables-min-sync-period, it will allow long-running connections to already switch to the new routes and the Shutdown process won't take as long. This prevent users with a low terminationGracePeriodSeconds, from not stopping long-lasting connections because the grace period ends before Shutdown is called.

Please do copy. I just wrote it down as an experiment. I officially give you all the copyrights for that code 😄.

@dippynark
Copy link
Author

dippynark commented Dec 3, 2023

@inteon Sounds good! I have changed to disable keep-alives so I think the PR is good, as long as you're happy with defaulting ShutdownDelay to 0? 10 seconds works for me too, just couldn't really make up my mind (I think very slight preference for 0 to maintain current behaviour and because the optimal value depends on the cluster configuration)

@alvaroaleman
Copy link
Member

Another thing that I do not understand about this: The Kube-Apiserver doesn't actually use kube-proxy to reach webhooks and extended apiserves, it directly resolves the endpoints internally which means the described problem shouldn't apply...?

@dippynark
Copy link
Author

@alvaroaleman I thought this was the case too before working on this, but as far as I can tell the standard API Server -> webhook setup does rely on kube-proxy.

Definitely not 100% on this, but my understanding is that when the API server wants to connect to a webhook behind a Kubernetes Service, it typically proxies the request via the Konnectivity Agent running on the cluster which performs Service DNS resolution and connects to the Service VIP, making use of kube-proxy routing; this allows connectivity even when the control plane and webhook Pods are in disjoint networks.

I tried to follow through the API server code which seems to confirm this. For example, here the API server is setting up the webhook client using the Service hostname before delegating to the cluster dialer provided by the EgressSelector instead of picking one of the endpoints itself.

Comment on lines 264 to 265
case <-time.After(s.Options.ShutdownDelay):
case <-ctx.Done():
Copy link
Member

@inteon inteon Dec 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This construction basically means that ShutdownDelay should not be longer than 1 minute, right?
Maybe that is something we can enforce & document in the comments linked to this field?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm yeah I don't think it's great being capped at 1 minute. Does it look better now? I've changed it to always wait for the full shutdown delay

@inteon
Copy link
Member

inteon commented Dec 11, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 11, 2023
@dippynark
Copy link
Author

/test pull-controller-runtime-test

// migrating clients to server instances that are not about to shutdown
srv.SetKeepAlivesEnabled(false)
// Wait before shutting down webhook server
<-time.After(s.Options.ShutdownDelay)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<-time.After(s.Options.ShutdownDelay)
time.Sleep(s.Options.ShutdownDelay)

We don't need to create a channel, I think.

Copy link
Author

@dippynark dippynark Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, fixed

@@ -101,6 +101,10 @@ type Options struct {

// WebhookMux is the multiplexer that handles different webhooks.
WebhookMux *http.ServeMux

// ShutdownDelay delays server shutdown to wait for clients to stop opening
// new connections before closing server listeners. Defaults to 0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open question:
Would it make sense to set a > 0 default, eg. 15s which is less than the 30 seconds default graceful shutdown delay?

Copy link
Author

@dippynark dippynark Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a tricky one for sure, but because it's so specific to the Kubernetes cluster you're running on it doesn't feel right to align to any particular cluster or cloud provider (e.g. I'm not sure what value EKS and AKS use for example).

IMO we should keep it at 0 for backwards compatibility and to ensure that connection draining 'doesn't work' consistently across clusters by default. I think this is also more likely to encourage applications that care a lot about this (e.g. Gatekeeper) to expose a flag to change this to make it easier to reconfigure for the cluster you're running on (i.e. push the decision onto users of controller-runtime since an optimal default isn't clear).

No problems with changing to 15 seconds by default though if you were leaning towards that (that would be better for us anyway because we're using GKE 😄). We could alternatively align to the kube-proxy default of 1 second? But on GKE at least this won't help much.

@inteon
Copy link
Member

inteon commented Dec 12, 2023

Another thing that I do not understand about this: The Kube-Apiserver doesn't actually use kube-proxy to reach webhooks and extended apiserves, it directly resolves the endpoints internally which means the described problem shouldn't apply...?

@alvaroaleman There seems to be a --enable-aggregator-routing flag, maybe that is related? (https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/)

Turns on aggregator routing requests to endpoints IP rather than cluster IP.

@sbueringer
Copy link
Member

Q: Did I get this correctly that kube-proxy reacts to the deletionTimestamp on the Pod and not only on readiness when dropping a Pod from a Service?

@dippynark
Copy link
Author

dippynark commented Dec 18, 2023

@sbueringer Yeah, my understanding is that the following actions are triggered simultaneously when a Pod's deletion timestamp is set:

  • The kubelet runs any preStop hooks and then sends a TERM signal to the container processes
  • EndpointSlices are updated to indicate that the Pod is terminating which triggers kube-proxy to remove corresponding iptables rules

There are more details here: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination

Note that there is currently no way to guarentee that all kube-proxy instances have reconciled EndpointSlice updates before shutting down (leading to issues like I described in this PR). The best mechanism I am aware of is to delay shutdown for slightly longer than the --iptables-min-sync-period kube-proxy flag: kubernetes/kubernetes#106476

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 17, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants