✨ Add ability to register runnables as pre-start hooks #2044

terinjokes · 2022-11-14T15:06:02Z

When implementing a controller that uses leader election, there maybe be
work that needs to be done after winning the election but before
processing enqueued requests. For example, a controller may need to
build up an internal mapping of the current state of the cluster before
it can begin reconciling.

This changeset adds support for adding prestart hooks to
controller-runtime's manager implementation. This hook runs after the
manager has been elected leader, immediately before the leader election
controllers are started.

Related #607

k8s-ci-robot · 2022-11-14T15:06:12Z

Hi @terinjokes. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alvaroaleman · 2022-11-15T02:19:31Z

For example, a controller may need to build up an internal mapping of the current state of the cluster before it can begin reconciling.

Can you elaborate? Why not use a sync.Once at the beginning of the reconciliation? What you want to do is definitely already possible today, albeit a bit cumbersome. I am not convinced this is a usecase common enough to warrant adding first class support.

terinjokes · 2022-11-15T13:56:12Z

Thanks @alvaroaleman to taking a look at this PR. Unfortunately, I've found that using sync.Once inside of a reconciler function is not a great alternative on both ergonomic and correctness fronts. Attempting to do so will, depending on the controller involved, cause incorrect reconciliations at best, or loss of data at worse. I'll explore these in kind.

Ergonomics

Currently the reconciler function has a clear scope: it processes a single request to reconcile objects in the cluster. This makes testing it very easy. By attempting to add more functionality to the reconciler function, we have made it more difficult to test: now we must ensure the once function is safe to run in tests, be able to swap the function for one that is safe, reuse a once already done across tests, or split our reconcile function into two. I'm not a fan of any of these options, they all complicate the testing of the controller.

Correctness

Attempting to use sync.Once to perform work before starting reconciliation is difficult to do correctly, especially when coupled with the internal controller logic.

The sync.Once API does not have previsions for returning an error. The do function must be carefully written to properly lift the error for all workers. Even if we do lift the error out of the do function, there is no provision for the reconcile function to stop the controller: returning an error only requeues the item.

We can attempt to work around this by using panic in our do function. This will eventually stop the entire program by default, but now we're using panics for flow control. If we don't recover on the main goroutine, we lose the ability to handle these failures such as to do any cleanup work, and to release any locks.

Unless one sets the controller option RecoverPanic, then the panic is recovered by the internal controller and turned into an error, and the item is requeued.

This becomes even harder to do correctly if you run multiple workers with MaxConcurrentReconciles. As the Godoc for sync.Once explains:

If f panics, Do considers it to have returned; future calls of Do return without calling f.

This allows other workers to continue past the once while the panic is being handled in the panic goroutine, and potentially take destructive actions due to incomplete work by the once function.

This pull request, which is roughly 15 lines, runs the prestart hooks before starting workers, so there is no synchronization issues across workers to deal with. A prestart function returning error stops the controller, returning an error from the Start function, which the manager can use to stop other Runnables and release resources, before returning the error to the main package.

This follows the normal flow control for errors users of controller-runtime expect, and eliminates the possibility of handling the error case incorrectly.

For example, a controller may need to build up an internal mapping of the current state of the cluster before it can begin reconciling.

Can you elaborate?

I've seen this problem often when we need to compute things across the entire cluster before being able to reconcile individual resources. For the current project I'm working on, I need to compute address allocation maps so I don't allocate the same address twice and cause havoc in networking. I've had use cases where I would have used a hook, and due to their absence I usually end up using client-go instead, but that's been a miserable experience.

As people continue to build on Kubernetes, especially when combined with Crossplane and kcp, I expect even more people needing to do things between winning a leader election and starting reconcilers.

alvaroaleman · 2022-11-16T00:39:27Z

Right, a sync.Once doesn't work but you can make it work with a lock.

I am open to the idea of running code after leader election was won and before a controller was started but I dislike the idea of tying that to the controller.

I'd suggest that you implement something like

controller-runtime/pkg/manager/manager.go

Lines 339 to 344 in 5378660

    
           // LeaderElectionRunnable knows if a Runnable needs to be run in the leader election mode. 
        
           type LeaderElectionRunnable interface { 
        
           	// NeedLeaderElection returns true if the Runnable needs to be run in the leader election mode. 
        
           	// e.g. controllers need to be run in leader election mode, while webhook server doesn't. 
        
           	NeedLeaderElection() bool 
        
           }

in the manager as the manager is the entity that is responsible for managing the lifecycle of components.

Note that depending on your usecase a source.SyncingSource might also be useful.

alvaroaleman · 2022-11-16T00:43:30Z

The easiest way to achieve what you are doing here (running code before a given controller starts) would btw be to construct one through NewUnmanaged and wrap it with something that executes your code in Start() before calling the controllers Start()

terinjokes · 2022-11-16T08:31:00Z

I put it in the controller because the content of a prestart hook, in my experience, is tied to the content of the reconciler. The controller is responsible for starting and maintaining watches, and gives us a spot to run between cached being synced and the reconciler function.

Putting it in the manager, which I also considered, means it is running separately from a controller setting up watches and waiting for caches to synchronize, and would have to do everything by directly calling the API server. This seems like a waste if we're also trying to cache the very same resources. It would also be a waste if the controller fails to synchronize the caches: it will exit with an error.

alvaroaleman · 2022-11-16T14:22:03Z

Putting it in the manager, which I also considered, means it is running separately from a controller setting up watches and waiting for caches to synchronize, and would have to do everything by directly calling the API server.

No, it will start the caches first.

I put it in the controller because the content of a prestart hook, in my experience, is tied to the content of the reconciler. The controller is responsible for starting and maintaining watches, and gives us a spot to run between cached being synced and the reconciler function.

Be that as it may for your use case, but then the next person is going to come and want it independent of a specific controller and then we have two options and ppl will ask what to use when and I don't want that. Controller-Runtime is supposed to make the standard use case easy and the advanced use case possible. Running code in leader election before any controllers start is not possible today, doing it before a specific controller starts is by either wrapping it or using a syncingSource, hence no need for this.

terinjokes · 2022-11-16T15:56:02Z

No, it will start the caches first.

My understanding is that a controller's watches are started and caches as synced in the controller's Start function. Before this point no informers have been setup.

controller-runtime/pkg/internal/controller/controller.go

Lines 181 to 218 in 44c5d50

    
           // NB(directxman12): launch the sources *before* trying to wait for the 
        
           // caches to sync so that they have a chance to register their intendeded 
        
           // caches. 
        
           for _, watch := range c.startWatches { 
        
           	c.LogConstructor(nil).Info("Starting EventSource", "source", fmt.Sprintf("%s", watch.src)) 
        
           	if err := watch.src.Start(ctx, watch.handler, c.Queue, watch.predicates...); err != nil { 
        
           		return err 
        
           	} 
        
           } 
        
           // Start the SharedIndexInformer factories to begin populating the SharedIndexInformer caches 
        
           c.LogConstructor(nil).Info("Starting Controller") 
        
           for _, watch := range c.startWatches { 
        
           	syncingSource, ok := watch.src.(source.SyncingSource) 
        
           	if !ok { 
        
           		continue 
        
           	} 
        
           	if err := func() error { 
        
           		// use a context with timeout for launching sources and syncing caches. 
        
           		sourceStartCtx, cancel := context.WithTimeout(ctx, c.CacheSyncTimeout) 
        
           		defer cancel() 
        
           		// WaitForSync waits for a definitive timeout, and returns if there 
        
           		// is an error or a timeout 
        
           		if err := syncingSource.WaitForSync(sourceStartCtx); err != nil { 
        
           			err := fmt.Errorf("failed to wait for %s caches to sync: %w", c.Name, err) 
        
           			c.LogConstructor(nil).Error(err, "Could not wait for Cache to sync") 
        
           			return err 
        
           		} 
        
           		return nil 
        
           	}(); err != nil { 
        
           		return err 
        
           	} 
        
           }

While the manager has a "Caches" runnable group, which would get started before leader election controllers, I can't find anything in controller-runtime that uses it. The only two structs that implement the hasCache interface is cluster and controllerManager. In all my controllers the Caches runnable group always has an empty startQueue.

Running code in leader election before any controllers start is not possible today, doing it before a specific controller starts is by either wrapping it or using a syncingSource, hence no need for this.

Since the watches are started and the caches synchronized by the controller, I do not understand your proposed solution of wrapping the controller. From my understanding. it's not currently possible to run code between the watches being started and synchronized and the reconcile function being called. This pull request adds that possibility.

If I'm misunderstanding something, do you have psuedo-code to get me pointed the right direction by what you mean?

we have two options and ppl will ask what to use when and I don't want that

I'm happy to rename the function and expand the Godocs to reduce confusion.

alvaroaleman · 2022-11-16T16:21:05Z

My understanding is that a controller's watches are started and caches as synced in the controller's Start function. Before this point no informers have been setup.

After the cache got started, getting any informer from it will result in an informer for that given resource being created if it doesn't exist yet. The client you get from the manager uses the cache for reading and autocreates any missing informers, meaning you can just use it after the cache was started and everything will work.

While the manager has a "Caches" runnable group, which would get started before leader election controllers, I can't find anything in controller-runtime that uses

It is used by the cluster.Cluster which among other things contains a cache. It gets started before leader election but won't contain any informer yet, as described above

controller-runtime/pkg/manager/internal.go

Line 437 in 7399a3a

if err := cm.add(cm.cluster); err != nil {
controller-runtime/pkg/manager/internal.go

Line 466 in 7399a3a

if err := cm.runnables.Caches.Start(cm.internalCtx); err != nil {

terinjokes · 2022-11-16T17:50:25Z

Okay, now I understand what you were trying to explain. I can take this pull request in two different directions now:

Move this hook to the manager so that it can run before the LeaderElection runnable group, or
Extend package builder to be able to construct unmanaged controllers.

Do you have a preference?

alvaroaleman · 2022-11-16T17:56:34Z

Move this hook to the manager so that it can run before the LeaderElection runnable group

That sounds fine to me

Extend package builder to be able to construct unmanaged controllers.

No. The builder is a helper that should make the standard cases easy. Making all cases possible through it would contradict that goal. You can construct an unmanaged controller without the builder.

terinjokes · 2022-11-16T18:03:39Z

I disagree that this would be a radical change to builder: the use case is already there for its tests. But I'll table it and open a proposal at a future time. It's not needed for me right now since you've selected option one.

terinjokes · 2022-11-16T19:32:33Z

~~I pushed one version of this change for review.~~ Would you prefer a new runnable group be made?

Edit: I'm not entirely happy with the current way these hooks work, so I'm going to take another go at it.

terinjokes · 2022-11-18T14:17:07Z

This is ready for another review, when you have a moment.

pkg/manager/manager.go

pkg/manager/internal.go

terinjokes · 2023-04-24T15:01:30Z

Taking a look at those test failures.

terinjokes · 2023-04-25T10:00:13Z

/retest

terinjokes · 2023-04-25T10:16:18Z

The test failure doesn't seem relevant to my changes, and I can't reproduce locally. It looks to me like it's timing based flakiness. Can I get some assistance?

✓  pkg/internal/controller (4.745s)

erikgb · 2023-05-16T21:32:53Z

I think the "test" failure is just stating that this PR contains an incompatible change:

sigs.k8s.io/controller-runtime/pkg/manager
  Incompatible changes:
  - Manager.Hook: added
  Compatible changes:
  - HookPrestartType: added
  - HookType: added
  - Options.HookTimeout: added

The Hook function is added to the Manager interface in this PR, and that is not a compatible change. If we want to merge with this change, I think this failure must be ignored.

k8s-ci-robot · 2023-07-20T12:11:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: terinjokes
Once this PR has been reviewed and has the lgtm label, please ask for approval from vincepri. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

terinjokes · 2023-07-20T12:23:30Z

Rebased onto 0.15, moved Hook into a separate interface.

k8s-triage-robot · 2024-01-20T21:09:54Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

terinjokes · 2024-01-22T21:53:09Z

Rebased changes on top of 0.17 and the tests are passing locally.

/remove-lifecycle stale

terinjokes · 2024-01-22T22:11:30Z

/retest

pkg/manager/internal.go

pkg/manager/manager.go

k8s-triage-robot · 2024-04-29T00:05:47Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

When implementing a controller that uses leader election, there maybe be work that needs to be done after winning the election but before processing enqueued requests. For example, a controller may need to build up an internal mapping of the current state of the cluster before it can begin reconciling. This changeset adds support for adding prestart hooks to controller-runtime's manager implementation. This hook runs after the manager has been elected leader, immediately before the leader election controllers are started. Related kubernetes-sigs#607

terinjokes · 2024-05-06T11:28:10Z

I've rebased again onto the main branch.

/remove-lifecycle stale

k8s-ci-robot · 2024-05-06T11:30:14Z

@terinjokes: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-controller-runtime-apidiff	`3ea050d`	link	false	`/test pull-controller-runtime-apidiff`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 14, 2022

k8s-ci-robot requested review from alenkacz and sbueringer November 14, 2022 15:06

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 14, 2022

terinjokes force-pushed the prestart-hook branch from bebdc3e to 782eb75 Compare November 14, 2022 15:07

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. labels Nov 14, 2022

terinjokes force-pushed the prestart-hook branch from 782eb75 to 04eb9b5 Compare November 16, 2022 19:30

terinjokes force-pushed the prestart-hook branch 3 times, most recently from de6797a to 3b513d7 Compare November 17, 2022 13:20

alvaroaleman reviewed Nov 25, 2022

View reviewed changes

pkg/manager/manager.go Outdated Show resolved Hide resolved

pkg/manager/internal.go Show resolved Hide resolved

terinjokes force-pushed the prestart-hook branch from 3b513d7 to fea0031 Compare December 27, 2022 21:04

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 27, 2022

k8s-ci-robot changed the title ~~✨ add prestart hook support~~ ⚠️ Add ability to register runnables as pre-start hooks Apr 20, 2023

terinjokes force-pushed the prestart-hook branch from c316099 to 3c057ad Compare April 24, 2023 13:52

terinjokes force-pushed the prestart-hook branch from 3c057ad to e0e7f45 Compare July 20, 2023 12:10

terinjokes force-pushed the prestart-hook branch from e0e7f45 to f760392 Compare July 20, 2023 12:22

terinjokes force-pushed the prestart-hook branch from f760392 to 4326f8e Compare July 20, 2023 12:25

sbueringer removed this from the v0.15.x milestone Aug 4, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2024

terinjokes force-pushed the prestart-hook branch from 4326f8e to 6c35525 Compare January 22, 2024 21:51

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024

joelanford reviewed Jan 23, 2024

View reviewed changes

pkg/manager/internal.go Outdated Show resolved Hide resolved

pkg/manager/manager.go Show resolved Hide resolved

pkg/manager/manager.go Outdated Show resolved Hide resolved

terinjokes commented Jan 23, 2024

View reviewed changes

pkg/manager/manager.go Outdated Show resolved Hide resolved

terinjokes force-pushed the prestart-hook branch 2 times, most recently from e48fcc2 to 00fb43a Compare January 29, 2024 23:49

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2024

terinjokes force-pushed the prestart-hook branch from 00fb43a to 3ea050d Compare May 6, 2024 11:26

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 6, 2024

terinjokes changed the title ~~⚠️ Add ability to register runnables as pre-start hooks~~ ✨ Add ability to register runnables as pre-start hooks May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Add ability to register runnables as pre-start hooks #2044

✨ Add ability to register runnables as pre-start hooks #2044

terinjokes commented Nov 14, 2022 •

edited

k8s-ci-robot commented Nov 14, 2022

alvaroaleman commented Nov 15, 2022

terinjokes commented Nov 15, 2022 •

edited

alvaroaleman commented Nov 16, 2022

alvaroaleman commented Nov 16, 2022

terinjokes commented Nov 16, 2022

alvaroaleman commented Nov 16, 2022

terinjokes commented Nov 16, 2022

alvaroaleman commented Nov 16, 2022

terinjokes commented Nov 16, 2022

alvaroaleman commented Nov 16, 2022

terinjokes commented Nov 16, 2022

terinjokes commented Nov 16, 2022 •

edited

terinjokes commented Nov 18, 2022

terinjokes commented Apr 24, 2023

terinjokes commented Apr 25, 2023

terinjokes commented Apr 25, 2023

erikgb commented May 16, 2023

k8s-ci-robot commented Jul 20, 2023

terinjokes commented Jul 20, 2023

k8s-triage-robot commented Jan 20, 2024

terinjokes commented Jan 22, 2024

terinjokes commented Jan 22, 2024

k8s-triage-robot commented Apr 29, 2024

terinjokes commented May 6, 2024

k8s-ci-robot commented May 6, 2024

✨ Add ability to register runnables as pre-start hooks #2044

Are you sure you want to change the base?

✨ Add ability to register runnables as pre-start hooks #2044

Conversation

terinjokes commented Nov 14, 2022 • edited

k8s-ci-robot commented Nov 14, 2022

alvaroaleman commented Nov 15, 2022

terinjokes commented Nov 15, 2022 • edited

Ergonomics

Correctness

alvaroaleman commented Nov 16, 2022

alvaroaleman commented Nov 16, 2022

terinjokes commented Nov 16, 2022

alvaroaleman commented Nov 16, 2022

terinjokes commented Nov 16, 2022

alvaroaleman commented Nov 16, 2022

terinjokes commented Nov 16, 2022

alvaroaleman commented Nov 16, 2022

terinjokes commented Nov 16, 2022

terinjokes commented Nov 16, 2022 • edited

terinjokes commented Nov 18, 2022

terinjokes commented Apr 24, 2023

terinjokes commented Apr 25, 2023

terinjokes commented Apr 25, 2023

erikgb commented May 16, 2023

k8s-ci-robot commented Jul 20, 2023

terinjokes commented Jul 20, 2023

k8s-triage-robot commented Jan 20, 2024

terinjokes commented Jan 22, 2024

terinjokes commented Jan 22, 2024

k8s-triage-robot commented Apr 29, 2024

terinjokes commented May 6, 2024

k8s-ci-robot commented May 6, 2024

terinjokes commented Nov 14, 2022 •

edited

terinjokes commented Nov 15, 2022 •

edited

terinjokes commented Nov 16, 2022 •

edited