Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚠ Support shutdown controllers and watches dynamically #2099

Conversation

FillZpp
Copy link
Contributor

@FillZpp FillZpp commented Dec 14, 2022

Signed-off-by: FillZpp FillZpp.pub@gmail.com

Support shutdown controllers and watches dynamically.

API changes:

  • An optional ControllerCtx added into controller.Options to let developer stop a specific controller and its watches.
  • A Stop() method added into some stoppable sources, e.g., Kind, KindWithCache, Informer, to let developer only stop a specific watch.

fixes #1884

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 14, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: FillZpp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 14, 2022
@FillZpp FillZpp force-pushed the support-shutdown-controllers-and-watches branch from 8154bac to fae6ec9 Compare December 14, 2022 04:01
@FillZpp
Copy link
Contributor Author

FillZpp commented Dec 14, 2022

K8s 1.26 is released and it adds removal of event handler kubernetes/kubernetes#111122 , so that c-r had better support this in the next v0.14 release.

I'm still adding tests. But one thing I'm not sure is that whether we should add a ControllerCtx into controller.Options or add a Stop method for the Controller? Which one is better to let developers stop a running controller?

Any suggestions? @alvaroaleman @vincepri @joelanford

pkg/internal/controller/controller.go Show resolved Hide resolved
is.mu.Lock()
defer is.mu.Unlock()
if is.canceled {
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we error rather than silently doing nothing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, maybe we have to prevent the source been stopped multiple times by controller or user?
For example, someone may create a controller with three watches, then stop one watch manually, and finally stop the whole controller. At the time, controller will stop the stopped watch source and it will get the error.


// ControllerCtx is the optional context for only this Controller. If it is set and been canceled,
// this controller and its watches will be stopped dynamically.
ControllerCtx context.Context
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add a Stop method instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, as I asked in #2099 (comment) .
Now I have changed the ControllerCtx to a Stop method.

is.canceled = true

if is.eventHandlerRegistration != nil {
return is.Informer.RemoveEventHandler(is.eventHandlerRegistration)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of this actually stops the informer which I'd argue is the more important part - Is that a folllow-up?

Admittedly not easy, we will likely need to do some kind of refcounting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. It's hard to determine when should we stop the informer. Not only the informers in cache are not only created by source watches, but also triggered by user's Get, List calls to DelegatingClient. So we can't manage the lifecycle of informer according to the reference count of how many active watches in it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud here a few options:

  1. Ref count for watches + some sort of LRU or timed cache for gets/lists that don't already have informers?
  2. Ref count for watches + if a get/list requires a new informer (i.e. a watch didn't already start one), that informer is never removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Just found this PR.

Gatekeeper is doing something similar, forking controller-runtime's informer cache to add a function that can remove informers. Would it be possible to have this available before solving the ref count issue? That would at least give users who are comfortable handling that complexity the ability to do so, but probably wouldn't catch unwary users.

Here is the key function we rely on:

https://github.com/open-policy-agent/gatekeeper/blob/8b426fb55da22abc0fe9bc925a3ca1ed08df50fe/third_party/sigs.k8s.io/controller-runtime/pkg/dynamiccache/informer_cache.go#L242-L251

We currently use it by maintaining a separate cache for dynamic watches:

https://github.com/open-policy-agent/gatekeeper/blob/8b426fb55da22abc0fe9bc925a3ca1ed08df50fe/pkg/watch/manager.go#L64-L91

That exports "registrars" to managing controllers to add or remove watches.

https://github.com/open-policy-agent/gatekeeper/blob/8b426fb55da22abc0fe9bc925a3ca1ed08df50fe/pkg/watch/registrar.go#L213-L276

https://github.com/open-policy-agent/gatekeeper/blob/8b426fb55da22abc0fe9bc925a3ca1ed08df50fe/pkg/controller/constrainttemplate/constrainttemplate_controller.go#L570-L582

Static watches (i.e. old-school controller.Watch() and watches initiated by client.Get()) use a separate cache.

Happy to talk more about this model, if interested, or help implementing parts if it means we can stop maintaining a fork!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great background! I might rebase and continue on the work on this PR (unless @FillZpp has time) to get it to the finish line before 0.15 is released. Feel free to reach out on slack if we want to chat more about it and brainstorm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to brainstorm! (sorry, it's been a busy week, so haven't reached out yet)

Copy link
Contributor Author

@FillZpp FillZpp Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for this late reply (I was on vacation last two weeks). I'll continue on this and get it into 0.15.

About the removal of informer, it seems we could have several ways:

  1. Keep ref counts to remove informers automatically, like @joelanford suggested.

Thinking out loud here a few options:

  1. Ref count for watches + some sort of LRU or timed cache for gets/lists that don't already have informers?
  2. Ref count for watches + if a get/list requires a new informer (i.e. a watch didn't already start one), that informer is never removed?

I prefer this way, but I'm not sure will this make users confused? Most of them don't know when and why an informer been removed or not, and the sequence of doing watch and get/list also affect whether the informer will be removed or not...

  1. Expose a method to let users manually remove a informer, as @maxsmythe suggested, IIUC.

  2. Maybe 1+2 both provided, if they are all necessary and needed.

WDYT @vincepri @alvaroaleman @joelanford @sbueringer

Copy link
Contributor

@maxsmythe maxsmythe Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I don't think (1) or (2) are in conflict: our real need is to have informer cache support a Remove() function, which is something a reference counter would need anyway. (2) just makes that behavior public.

Also, WRT reference counting, Gatekeeper has the additional nuance of using a different cache entirely for reference-count-style watch governance to avoid interference with watches established in more conventional ways (client.Get()) -- this seems similar to option (1.2).

Copy link
Contributor

@maxsmythe maxsmythe Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, because Gatekeeper uses generic controllers (we have one controller that listens for all constraint kinds), (2) would be a better fit for our model than governing watch livelihood at the controller granularity.

Because dynamic watches are probably best for generic-type controllers (if you have the type hard-coded, why the need for dynamic watches?), managing the watches directly may be a better fit than managing controllers. I think both are workable, but (2) would definitely be less of a reach for Gatekeeper to integrate.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 6, 2023
@FillZpp FillZpp force-pushed the support-shutdown-controllers-and-watches branch from e9c84ba to eb9b576 Compare January 13, 2023 09:14
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 13, 2023
@FillZpp FillZpp force-pushed the support-shutdown-controllers-and-watches branch 2 times, most recently from f387f9d to 5fc22ad Compare January 13, 2023 09:31
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 19, 2023
Comment on lines +89 to +92

// Stop stops the controller and all its watches dynamically.
// Note that it will only trigger the stop but will not wait for them all stopped.
Stop() error
Copy link
Member

@inteon inteon Jan 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FillZpp Why do we add a Stop function here instead of canceling the context that was passed to Start?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Not all of the implementations of Source can stop, e.g., Func or some custom types.
  2. Users have no way to cancel the context passed from internal controller to source, to only stop the single watch.

Signed-off-by: FillZpp <FillZpp.pub@gmail.com>
@FillZpp FillZpp force-pushed the support-shutdown-controllers-and-watches branch from 5fc22ad to 7728c9e Compare January 30, 2023 03:51
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 30, 2023
@FillZpp FillZpp marked this pull request as ready for review January 31, 2023 11:22
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 31, 2023
@FillZpp
Copy link
Contributor Author

FillZpp commented Jan 31, 2023

How about get this PR merged first, and I will post new PRs to support removal of informers (automatically & manually)? Otherwise it will become a huge PR that have too many API changes to review.
/cc @vincepri @alvaroaleman

@FillZpp
Copy link
Contributor Author

FillZpp commented Jan 31, 2023

/retest

@k8s-ci-robot
Copy link
Contributor

@FillZpp: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-controller-runtime-test-master 7728c9e link true /test pull-controller-runtime-test-master

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@inteon
Copy link
Member

inteon commented Jan 31, 2023

@FillZpp I created another PR (#2159) that tries to solve the same problem as your PR.
PTAL, feel free to accept the approach in that PR or to copy (some of) the code from that PR to this PR instead.

@FillZpp
Copy link
Contributor Author

FillZpp commented Feb 1, 2023

Thanks @inteon , I close this PR and help out on the new one.

/close

@k8s-ci-robot
Copy link
Contributor

@FillZpp: Closed this PR.

In response to this:

Thanks @inteon , I close this PR and help out on the new one.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add and remove watches at runtime
7 participants