Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a single cache for all dynamic controllers (i.e. XRs and claims) #5651

Merged
merged 10 commits into from May 21, 2024

Conversation

negz
Copy link
Member

@negz negz commented May 7, 2024

Description of your changes

Fixes #5338
Fixes #2645
Closes #5463
Closes #5468
Closes #5422

It's possible this will fix #5400, #5151, #5533, and #5228 given that it makes large changes to how realtime compositions work. I'd rather not assume though - we can see if folks can reproduce if/when this PR is merged.

Crossplane uses a "controller engine" to dynamically start claim and XR controllers when a new XRD is installed, and stop them when the XRD is uninstalled.

Before this PR, each controller gets at least one cache. This is because when I built the controller engine, you couldn't stop a single informer within a cache (a cache is basically a map of informers by GVK). Instead, we stop entire caches.

When realtime composition is enabled, there are even more caches. One per composed resource GVK. A GVK routed cache routes cache lookups to these various delegate caches.

Meanwhile, controller-runtime recently made it possible to stop an informer within a cache. It's also been possible to remove an event handler from an informer for some time (since Kubernetes 1.26):

This PR uses a single client, backed by a single cache, across all dynamic controllers (specifically the definition, offered, claim, and XR controllers). Each controller tracks what watches ("sources" in controller-runtime) it has active and stops them when it doesn't need them anymore. The controller engine also garbage collects custom resource informers when their CRDs are deleted.

Compared to the current implementation, this:

  • Takes fewer global locks when realtime compositions are enabled. Locking is now mostly at the controller scope.
  • Works with the breaking changes to source.Source introduced in controller-runtime v0.18. :)

I also think this makes the realtime composition code a little easier to follow by consolodating it into the ControllerEngine (and into one repo), but that's pretty subjective.

Here's a questionable Mermaid diagram of how it works:

graph TD
	subgraph ITC[engine.InformerTrackingCache]
		I1[XFoo Informer]
		I2[XBar Informer]
		I3[ComposedResource Informer]
	end
	
	subgraph CE[engine.ControllerEngine]
		subgraph C1[XFoo Controller]
			R1[XFoo Reconciler] -- Get --> I1
			R1 -- Get --> I3
			S1[XFoo Source] -- Watch --> I1
			S1 -- Enqueue --> R1
			SC1[ComposedResource Source] -- Watch --> I3
			SC1 -- Enqueue --> R1
		end

		subgraph C2[XBar Controller]
			R2[XBar Reconciler] -- Get --> I2
			R2 -- Get --> I2
			S2[XFoo Source] -- Watch --> I2
			S2 -- Enqueue --> R2
			SC2[ComposedResource Source] -- Watch --> I3
			SC2 -- Enqueue --> R2
		end
	end

	subgraph C[cache.Cache]
	  I4[XRD Informer]
	end

	subgraph M[manager.Manager]
		subgraph C3[XRD Controller]
			R3[XRD Reconciler] -- Start --> C1
			R3 -- Start --> C2
			R3 -- Get --> I4
			S3[XRD Source] -- Watch --> I4
			S3 -- Enqueue --> R3
		end
	end

It's conceptually pretty similar to a typical set of controller-runtime controllers run by a controller.Manager, but with the ability to stop watches and informers.

Reviewer Note

This PR depends on crossplane/crossplane-runtime#689.

Unfortunately, this required bumping a bunch of dependencies, mostly to get the latest controller-runtime. Those dependency bumps in turn required a bunch of other little fixes (CRD generation issues, new linter warnings, etc).

I also opted to move the ControllerEngine implementation from c/cr to c/c. I think that will make changes like this easier in future.

While this all makes this PR pretty huge, it's all broken out into in separate commits. The last commit in the PR contains the bulk of the changes - reviewers can focus on that.

I'm opening this as draft since it's a big change. I still need to add more tests for the new code (e.g. the new engine implementation).

I have:

Need help with this checklist? See the cheat sheet.

@negz negz force-pushed the engine-swap branch 4 times, most recently from f2f7122 to ab57920 Compare May 7, 2024 01:14
@negz
Copy link
Member Author

negz commented May 7, 2024

CC @sttts - this makes big changes to the realtime compositions implementation.

internal/controller/apiextensions/composite/reconciler.go Outdated Show resolved Hide resolved
internal/controller/apiextensions/composite/watch/watch.go Outdated Show resolved Hide resolved
internal/controller/engine/source.go Outdated Show resolved Hide resolved
go.mod Outdated Show resolved Hide resolved
cmd/crossplane/core/core.go Outdated Show resolved Hide resolved
internal/controller/engine/engine.go Outdated Show resolved Hide resolved
internal/controller/engine/engine.go Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
@negz negz force-pushed the engine-swap branch 2 times, most recently from c590da9 to 0ba7db9 Compare May 8, 2024 05:35
@negz
Copy link
Member Author

negz commented May 8, 2024

Looks like the E2E tests are passing pretty reliably now. The only thing blocking moving this out of draft is writing some more unit tests.

@turkenh
Copy link
Member

turkenh commented May 8, 2024

I also opted to move the ControllerEngine implementation from c/cr to c/c. I think that will make changes like this easier in future.

I am working on implementing watches in provider-kubernetes, and the PR is based on the previous real-time composition implementation, which is being refactored here. While the provider-kubernetes use case is more straightforward (no controller start/stop at runtime), I believe the new structure is more usable with the unified/shared cache then the previous implementation. When I check the functionality introduced here, I have a feeling like I could leverage things like TrackingInformers/InformerTrackingCache and StoppableSource, and maybe even Controller Engine with a single controller (TBD) 🤔 So, I would need them to be importable, at least not under internal if not back to crossplane-runtime 🙏

go.mod Outdated Show resolved Hide resolved
@negz
Copy link
Member Author

negz commented May 8, 2024

So, I would need them to be importable

@turkenh How would you feel about duplicating them to start with? We could then consolidate later.

I'm hopeful we can get the tracking cache and stoppable source functionality added upstream to controller-runtime. If that happens I could imagine a world where:

  • ControllerEngine is only used by c/c, lives in c/c (internal)
  • Provider-kubernetes also uses tracking cache and stoppable sources, which live in controller-runtime

@negz negz requested review from phisco and a team as code owners May 14, 2024 01:21
@negz negz requested a review from bobh66 May 14, 2024 01:21
@negz
Copy link
Member Author

negz commented May 14, 2024

I've moved this out of draft. It should be ready for final review.

@sttts Do you want to do another pass?
@turkenh Would you be interested in reviewing/approving on behalf of maintainers?

The history is very detailed - I'm leaving it as is to aid with review. I'll clean it up and squash it before merging.

@negz negz requested review from turkenh and sttts May 15, 2024 18:00
Makefile Outdated Show resolved Hide resolved
internal/engine/source.go Outdated Show resolved Hide resolved
cmd/crossplane/core/core.go Show resolved Hide resolved
internal/controller/apiextensions/composite/reconciler.go Outdated Show resolved Hide resolved
internal/controller/apiextensions/composite/watch/watch.go Outdated Show resolved Hide resolved
cmd/crossplane/core/core.go Show resolved Hide resolved
@negz
Copy link
Member Author

negz commented May 17, 2024

@turkenh Thanks for approving! I can't proceed until crossplane/crossplane-runtime#689 is also approved. Could you take a look? It's pretty much just deleting code.

@negz negz force-pushed the engine-swap branch 2 times, most recently from a8da7ab to 24b03c1 Compare May 20, 2024 21:06
negz added 10 commits May 20, 2024 15:37
This just moved the files, unedited, as of the below commit.

crossplane/crossplane-runtime@8641eb2

Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Nic Cope <nicc@rk0n.org>
This bumps crossplane-runtime, controller-runtime, and k8s.io
dependencies to latest.

Per the below PR, the latest crossplane-runtime doesn't have the
controller engine anymore. It moved into c/c.

crossplane/crossplane-runtime#689

Signed-off-by: Nic Cope <nicc@rk0n.org>
Updating our controller-runtime and Kubernetes dependencies bumped our
minimum Go version to v1.22. That in turn enables some new linters,
since we no longer need to copy range vars in Go v1.22.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Nic Cope <nicc@rk0n.org>
It started as a function, but now we pass several arguments that are all
fields of the Reconciler. It's only called once, by the Reconciler.
Making it a method shortens the function signature, and makes it clear
which things change on each reconcile and which are fixed.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Nic Cope <nicc@rk0n.org>
Crossplane uses a controller engine to dynamically start claim and XR
controllers when a new XRD is installed.

Before this commit, each controller gets at least one cache. This is
because when I built this functionality, you couldn't stop a single
informer within a cache (a cache is basically a map of informers by
GVK).

When realtime composition is enabled, there are even more caches. One
per composed resource GVK. A GVK routed cache routes cache lookups to
these various delegate caches.

Meanwhile, controller-runtime recently made it possible to stop an
informer within a cache. It's also been possible to remove an event
handler from an informer for some time (since Kubernetes 1.26).

kubernetes-sigs/controller-runtime#2285
kubernetes-sigs/controller-runtime#2046

This commit uses a single client, backed by a single cache, across all
dynamic controllers (specifically the definition, offered, claim, and
XR controllers).

Compared to the current implementation, this commit:

* Takes fewer global locks when realtime compositions are enabled.
  Locking is now mostly at the controller scope.
* Works with the breaking changes to source.Source introduced in
  controller-runtime v0.18. :)

I think this makes the realtime composition code a little easier to
follow by consolodating it into the ControllerEngine, but that's
pretty subjective.

Signed-off-by: Nic Cope <nicc@rk0n.org>
// RemoveInformer removes an informer entry and stops it if it was running.
//
// Removing an informer marks the informer as inactive.
func (c *InformerTrackingCache) RemoveInformer(ctx context.Context, obj client.Object) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: a smoke (unit) test might help to detect when the cache.Cache interface changes. The overridden methods are must be exhaustive for a given gvk. So there is risk that this breaks and will be hard to debug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach that could work would be to stop embedding cache.Cache. Instead wrap one and implement all of its methods. That way we'll stop satisfying the interface if it changes. Will follow up in another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants