Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] 🌱 Controller Lifecycle Management #1180

Conversation

kevindelgado
Copy link
Contributor

@kevindelgado kevindelgado commented Sep 25, 2020

This is a proof of concept PR for the Controller Lifecycle Management design doc at #1192.

It is:

  1. A rebase of shomron’s PR at ✨ Allow removing individual informers from the cache (#935) #936 to enable removal of individual informers that is pre-requisite for the other goals.
  2. A minimal set of hooks needed to enable a user to conditionally start/stop/restart controllers on their own.
  3. An implementation of EventHandler reference counting on the Informer.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Sep 25, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @kevindelgado!

It looks like this is your first PR to kubernetes-sigs/controller-runtime 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/controller-runtime has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @kevindelgado. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 25, 2020
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Sep 25, 2020
@kevindelgado kevindelgado changed the title ✨ Conditionally Runnable Controllers [WIP] ✨ Conditionally Runnable Controllers Sep 25, 2020
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 25, 2020
@kevindelgado
Copy link
Contributor Author

kevindelgado commented Sep 25, 2020

Adding WIP, recent change to prevent controllers from starting more than once (#1139) breaks this. Need to find a way to allow conditional runnables to set the controller's started field to false upon CRD uninstall.

}

cm.startedLeader = true
}

func (cm *controllerManager) startConditionalRunnables() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @kevindelgado , thanks for this PR. It seems to be its a bit chaotic and involves a set of semi-related changes, like the removal of informers which is IIRC not something we have agreed on and definitely something for a distinct PR.

As for the conditionanally runnables idea, a much simpler approach would be to just check whatever you want to check before adding them to the manager:

if clusterHasMyCRD(){
    controller.New()
}

This also has the advantage that you can check anything, not whatever happens to be implemented for this in the manager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @alvaroaleman!

We’re definitely open to better ways of doing this, but we are currently doing what you suggest. Outside of controller-runtime we conditionally start the manager based on the existence of the CRD. A couple issues we see are:

  1. Increasing complexity. Specifically, when multiple controllers are abstracted to appear as a single controller or when the manager is running multiple controllers that watch a mixture of CRDs that will always be installed on the cluster, but also CRDs that are only sometimes installed.
  2. This approach still isn’t great for fluidly handling CRD install/uninstall/reinstall (rather than just waiting for install before starting), especially when latency becomes important immediately upon CRD installation.

/cc @DirectXMan12 in case they can better explain our desire to push this functionality further down the stack into c-r.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds to me like you should probably build a metacontroller that watches crds and then starts and stops the actual controllers.

If you continue to think this requires changes in controller-runtime, the first step would be a design doc that describes the use-case, why this is problematic with how controller-runtime works currently, what changes would be required and possible alternatives so we can properly discuss this. We have the /designs folder for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, we’re discussing alternatives and working to create a more compelling story for this and will have a design doc up if/when that is the case.

Sorry for the lack of context here, when learning the codebase it was easier to express ideas through concrete code than in prose initially.

While the code has been helpful to steer discussions and now make it easier to formalize into a design doc, I definitely understand that it appears chaotic and disorganized externally. Feel free to close if you want to reduce clutter in the repo, I can always submit a new one if necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can probably move a lot of this into a separate helper runnable that manages its own sub-runnables and managed the start/stop of things. We'd still need a way to start/stop caches though, and possible some other minor tweaks.

At the very least I think it'd be useful to have some basic hooks for better being able to start/stop controllers in CR, especially for multi-cluster usecases, where you don't necessarily want to/can't degrade performance for all clusters just to deal with a change in one. It also provides a motivating minimal usecase for starting/stopping caches

I think one of the things to keep in mind here is that if we go with the approach of "oh, you should just build it as a meta-controller", we need to make sure the hooks are there to do it cleanly, and we need to make sure those hooks are documented as such so that they don't get removed later on down the road.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least I think it'd be useful to have some basic hooks for better being able to start/stop controllers in CR, especially for multi-cluster usecases, where you don't necessarily want to/can't degrade performance for all clusters just to deal with a change in one. It also provides a motivating minimal usecase for starting/stopping caches

Sure thing, but today we don't have any official story at all for multi-cluster without the assumption that clusters might come and go. I am not objecting to the basic idea (although I still think that if this is about being fast as mentioned above, a metacontroller would be the best), I just think that this requires a lot of groundwork first.

@kevindelgado kevindelgado marked this pull request as draft September 28, 2020 20:07
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 28, 2020
@kevindelgado kevindelgado changed the title [WIP] ✨ Conditionally Runnable Controllers ✨ Conditionally Runnable Controllers Sep 28, 2020
@kevindelgado
Copy link
Contributor Author

Fixed the rebasing issues caused by not being able to restart controllers.

@kevindelgado kevindelgado marked this pull request as ready for review September 28, 2020 21:24
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 28, 2020
@kevindelgado kevindelgado marked this pull request as draft September 28, 2020 22:46
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 28, 2020
@kevindelgado kevindelgado changed the title ✨ Conditionally Runnable Controllers [WIP] ✨ Conditionally Runnable Controllers Sep 28, 2020
if !ok {
log.Error(errors.New("runnable is not a controller"), "")
}
controller.Started = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a safe operation, IIRC

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kevindelgado
To complete the pull request process, please assign joelanford after the PR has been reviewed.
You can assign the PR to them by writing /assign @joelanford in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kevindelgado kevindelgado changed the title [WIP] ✨ Conditionally Runnable Controllers [WIP] ✨ Conditional Controllers Proof of Concept Oct 1, 2020
@kevindelgado kevindelgado changed the title [WIP] ✨ Conditional Controllers Proof of Concept [WIP] 🌱 Conditional Controllers Proof of Concept Oct 1, 2020
@kevindelgado
Copy link
Contributor Author

Here’s a major overhaul of the previous approach that hopefully is much less invasive to the existing code that goes along with the design doc at #1192. Gonna bring this up for discussion at the next kubebuilder meeting.

If you can take a look at each commit individually, hopefully it comes across as less chaotic.

The second one is pretty minimal and is the only change strictly necessary to solving our problem (the second and third are optional and less fleshed out, the first is just a direct rebase and copy of #936).

The hope is that we can find some part of this small enough to move forward with, without first having a great story around multi-cluster.

@alvaroaleman @DirectXMan12

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 1, 2020
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 7, 2020
@kevindelgado kevindelgado changed the title [WIP] 🌱 Conditional Controllers Proof of Concept [WIP] 🌱 Controller Restart Minimal Hooks Oct 7, 2020
@kevindelgado
Copy link
Contributor Author

I've distilled this PR into just the minimal changes to C-R necessary to enable better controller lifecycle management.

The POC for conditional controllers is on this branch https://github.com/kevindelgado/controller-runtime/tree/experimental/conditional-runnables, but it was clear that there's plenty to discuss just on allowing informer removal and the minimal hooks needed use informer removal to start/stop/restart controllers cleanly.

The design doc at #1192 has also been updated to reflect this.

shomron and others added 3 commits October 18, 2020 08:07
This change adds support for removing individual informers from the
cache using the new Remove() method. This is allowed before or after the
cache has been started.  Informers are stopped at the time of removal -
once stopped, they will no longer deliver events to registered event
handlers, and registered watched will be aborted.

Also adds non-blocking API for getting informers without waiting for
their
cache to sync - GetInformerNonBlocking().

Signed-off-by: Oren Shomron <shomron@gmail.com>
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 19, 2020
@kevindelgado kevindelgado changed the title [WIP] 🌱 Controller Restart Minimal Hooks [WIP] 🌱 Controller Lifecycle Management Oct 19, 2020
@kevindelgado
Copy link
Contributor Author

Updated to reflect updates to the doc #1192, PTAL anyone interested

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 17, 2021
@k8s-ci-robot
Copy link
Contributor

@kevindelgado: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 17, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 16, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants