Karpenter NodeClaim Disruption `spec.nodeClassRef` Drift Detection #337

njtran · 2023-05-11T21:02:29Z

Tell us about your request

Currently, the node controller which is responsible for annotating nodes as expired, drifted, or empty, watches node, provisioner, and pod events, and enqueues reconcile requests for the linked nodes.

https://github.com/aws/karpenter-core/blob/main/pkg/controllers/node/controller.go#L110-L140

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Since karpenter-core isn't aware of provider-specific CRDs, it doesn't track watch events for these CRDs.

We may be able to just watch v1beta1.NodeClaimReference objects.

Are you currently working around this issue?

This only affects the drift annotation. If someone changes their AWSNodeTemplate, they'll have to rely on the node heartbeat or another object event to get the node annotated. Since the heartbeat is 5 minutes, this is still happening at a good pace, but it could definitely be faster.

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

ellistarn · 2023-05-15T20:56:59Z

Knative has a "duck hunter" solution for this exact use case: https://github.com/knative/pkg/blob/main/apis/duck/ABOUT.md. I don't think we should build in this complexity, though.

jonathan-innis · 2023-11-25T05:29:29Z

This could be interesting. Looks like the latest versions of controller-runtime support calling Build() which returns the controller back to you so you can dynamically add and remove structured watchers at runtime. Could be interesting to discover a watcher and add it dynamically on a NodeClaimReference.

sadath-12 · 2023-11-29T13:56:09Z

referring due to similarity of the problem #825 .

jonathan-innis · 2023-12-05T06:49:38Z

Some ideas around how we might be able to handle this:

We can look and see if there is a way that we could enable the controller to dynamically register and unregister nodeclasses with the cluster so that anytime we get a new NodeClassRef, we can just add a watcher for that to the controller. When we lose all NodePools that no longer have a ref, we can remove the watcher for that GVK
We discover nodeclasses that exist on the cluster through CRD discovery. You could make CloudProviders throw some well-known annotation on their CRDs. Like karpenter.sh/nodeclass: true . When you start-up Karpenter, you just look to check if any CRDs exist with this annotation and you watch all CRDs that have it. Likewise, you could do the same mechanism as number 1 where you make a watcher on CRDs and then dynamically update the watch of the NodePool controller when you see another NodeClass that contains that annotation.
We make CloudProviders register their GVKs through the apis directory, so you do something like apis.WellKnownNodeClasses and then you watch across all of those.
We can do a polling mechanism (this is definitely the easiest way to go). So you continually re-poll for the EC2NodeClass to get it’s readiness. The only thing here is there is going to be some delay in propagating the EC2NodeClass readiness if it ever changes to NotReady. This might cause some behavior in our scheduler where we schedule to a NodeClaim that is actually going to fail when launched. This is probably fine since we could just delete the NodeClaim and try again but it’s worth mentioning that this particular one might result in more failures than the others due to the delay

ellistarn · 2023-12-05T07:03:25Z

I like #1, to use the refs to dynamically drive watchers.

jonathan-innis · 2023-12-05T08:33:48Z

to use the refs to dynamically drive watchers

It's definitely the lowest CloudProvider burden. Would be interesting to see how we could handle this with the new dynamic watch construct that controller-runtime has put out. Basically, whenever you get a NodePool creation, you can check if the watch exists and add it to the controller. When you get a Delete you do the opposite. You shouldn't miss a delete while the controller is up and when the controller is down you don't care.

jonathan-innis · 2023-12-21T23:48:12Z

Would be interesting to see how we could handle this with the new dynamic watch construct that controller-runtime has put out

Looking into this a little deeper, it doesn't look like they offer a way to unregister the watcher; so, if we went this route, we would be implementing a watch that would be added as soon as the kind showed up and existed for the lifetime of the run.

Alternatively, we could consider using a channel here as a source for events that came from another controller; however, in that case we would probably end up having to maintain our own register/deregister logic for informers to maintain knowledge from the dynamic client about the updates.

jonathan-innis · 2023-12-22T21:16:41Z

More context for removing watches on the informers dynamically can be found in this proposed PR: kubernetes-sigs/controller-runtime#2159 with some of the original context for the PR found in this issue: kubernetes-sigs/controller-runtime#1884

jonathan-innis · 2023-12-22T23:13:44Z

I wrote a little POC of trying this out with a controller here: https://github.com/jonathan-innis/karpenter/blob/requeue-on-gvk/pkg/controllers/nodepool/gvk/controller.go. It surfaces a watch channel which can be used and watched on by other controller like is seen here: https://github.com/jonathan-innis/karpenter/blob/requeue-on-gvk/pkg/controllers/nodeclaim/disruption/controller.go#L135

k8s-triage-robot · 2024-03-21T23:24:50Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

engedaam · 2024-03-21T23:53:42Z

/remove-lifecycle stale

njtran added the kind/feature Categorizes issue or PR as related to a new feature. label May 11, 2023

engedaam self-assigned this May 15, 2023

njtran added performance Issues relating to performance (memory usage, cpu usage, timing) v1 Issues requiring resolution by the v1 milestone labels May 16, 2023

engedaam removed their assignment Aug 9, 2023

billrayburn added v1.x Issues prioritized for post-1.0 and removed v1 Issues requiring resolution by the v1 milestone labels Sep 6, 2023

jonathan-innis changed the title ~~Karpenter Node Controller ProviderRef Detection~~ Karpenter NodeClaim Disruption spec.nodeClaimRef Detection Nov 25, 2023

jonathan-innis changed the title ~~Karpenter NodeClaim Disruption spec.nodeClaimRef Detection~~ Karpenter NodeClaim Disruption spec.nodeClaimRef Drift Detection Nov 25, 2023

jonathan-innis changed the title ~~Karpenter NodeClaim Disruption spec.nodeClaimRef Drift Detection~~ Karpenter NodeClaim Disruption spec.nodeClassRef Drift Detection Dec 5, 2023

jonathan-innis mentioned this issue Dec 11, 2023

v1 Laundry List #758

Open

jonathan-innis mentioned this issue Dec 24, 2023

How to handle unset apiVersion and kind in nodeClassRef #909

Open

engedaam mentioned this issue Jan 24, 2024

Drift Hash Versioning #957

Closed

engedaam mentioned this issue Mar 4, 2024

Custom Drift Interval #1052

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2024

This was referenced Mar 28, 2024

feat: Detect Drift on NodeClaims on changes to NodeClass #1143

Closed

feat: Detect Drift on NodeClaims on changes to NodeClass #1147

Merged

k8s-ci-robot closed this as completed in #1147 Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karpenter NodeClaim Disruption `spec.nodeClassRef` Drift Detection #337

Karpenter NodeClaim Disruption `spec.nodeClassRef` Drift Detection #337

njtran commented May 11, 2023 •

edited by jonathan-innis

ellistarn commented May 15, 2023

jonathan-innis commented Nov 25, 2023 •

edited

sadath-12 commented Nov 29, 2023

jonathan-innis commented Dec 5, 2023 •

edited

ellistarn commented Dec 5, 2023

jonathan-innis commented Dec 5, 2023 •

edited

jonathan-innis commented Dec 21, 2023

jonathan-innis commented Dec 22, 2023 •

edited

jonathan-innis commented Dec 22, 2023

k8s-triage-robot commented Mar 21, 2024

engedaam commented Mar 21, 2024

Karpenter NodeClaim Disruption spec.nodeClassRef Drift Detection #337

Karpenter NodeClaim Disruption spec.nodeClassRef Drift Detection #337

Comments

njtran commented May 11, 2023 • edited by jonathan-innis

Tell us about your request

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Are you currently working around this issue?

Additional Context

Attachments

Community Note

ellistarn commented May 15, 2023

jonathan-innis commented Nov 25, 2023 • edited

sadath-12 commented Nov 29, 2023

jonathan-innis commented Dec 5, 2023 • edited

ellistarn commented Dec 5, 2023

jonathan-innis commented Dec 5, 2023 • edited

jonathan-innis commented Dec 21, 2023

jonathan-innis commented Dec 22, 2023 • edited

jonathan-innis commented Dec 22, 2023

k8s-triage-robot commented Mar 21, 2024

engedaam commented Mar 21, 2024

Karpenter NodeClaim Disruption `spec.nodeClassRef` Drift Detection #337

Karpenter NodeClaim Disruption `spec.nodeClassRef` Drift Detection #337

njtran commented May 11, 2023 •

edited by jonathan-innis

jonathan-innis commented Nov 25, 2023 •

edited

jonathan-innis commented Dec 5, 2023 •

edited

jonathan-innis commented Dec 5, 2023 •

edited

jonathan-innis commented Dec 22, 2023 •

edited