When NodePool is updated, Karpenter fails to remove Failing NodeClaim #6098

Nuru · 2024-04-25T16:13:58Z

Description

Observed Behavior:

Deploy a Service Control Policy (SCP) limiting what kinds of instances can be launched
Deploy Karpenter with a single NodePool allowing all instances
Launch Pod compatible with NodePool but not compatible with any running nodes
Karpenter creates a claim which specifies an instance type not allowed by SCP. This is not guaranteed, and not by itself an error. However, when it happens, Karpenter creates a NodeClaim that will forever fail.
Get error from Karpenter: launching nodeclaim, creating instance, with fleet error(s), UnauthorizedOperation
Update NodePool to align with SCP.
Karpenter continues to try to launch the failing NodeClaim, and the pod that triggered it remains unschedulable.

When I ran into this behavior, it actually went like this:

NodePool requires arch amd64
Pod requires arch arm64
Update NodePool to allow arm64
Karpenter creates NodeClaim for t4.small
Karpenter reconciler errors out trying to launch NodeClaim: operation not allowed due to explicit deny in SCP
Update NodePool again with explicit list of allowed instance families
Karpenter reconciler continues error loop, pod continues to be unschedulable
I manually delete failing NodeClaim
Karpenter creates new NodeClaim, successfully launches allowed instance type, pod is scheduled and deployed

Expected Behavior:

When NodePool is updated, Karpenter removes any NodeClaims that are disallowed by the new NodePool requirements, and creates new NodeClaims if needed to schedule pods.

Reproduction Steps (Please include YAML):

See above

Versions:

Chart Version: 0.36.0
Kubernetes Version (kubectl version): 1.26+

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

engedaam · 2024-04-30T15:58:43Z

Currently, Karpenter does not remove instance types that receive UnauthorizedOperation for its list of instance types. What we would expect in the your case would be that Karpetner will delete the NodeClaim after 15 minutes of failing to launch a node and then create a new NodeClaim would be created that would be within the NodePool configured set of instances types. Did you see the NodeClaim get removed after 15 minutes? Also, since your NodeClaim never launched a node, drift would not have reacted.

rtnpro · 2024-05-03T03:11:11Z

Hey @engedaam I want to work on this issue.

Nuru · 2024-05-03T06:14:00Z

@engedaam No, I did not wait 15 minutes to see if the NodePool would be deleted as failing.

I expected that when I updated the list of acceptable instance types, any NodePools, failing or not, that specified instance types that were no longer allowed, would be deleted immediately.

engedaam · 2024-05-03T19:23:38Z

@Nuru Karpenter currently has a check to validate that a node is first launched before deleting the NodeClaim. We wait for the launch now, because we don't know the exact node that we will get since that is give after the launch of the instance. https://github.com/kubernetes-sigs/karpenter/blob/e4abe9387198cacfece0eac53f241a3e7dd2ac76/pkg/controllers/nodeclaim/disruption/drift.go#L62

jonathan-innis · 2024-05-13T03:40:56Z

I expected that when I updated the list of acceptable instance types, any NodePools, failing or not, that specified instance types that were no longer allowed, would be deleted immediately

@Nuru Do you know if the controller role was unauthorized to launch with an instance type, would those instance types still appear in the DescribeInstanceTypes? I don't think there's a way for us to discover this without launching the instance and deleting the instance could lead to us just retrying to launch the same instance over and over again in an infinite loop.

Does it work for you to update your NodePool instance types that you are selecting to match the IAM policy that you are specifying?

Nuru · 2024-05-13T10:22:45Z

I expected that when I updated the list of acceptable instance types, any NodePools, failing or not, that specified instance types that were no longer allowed, would be deleted immediately

@Nuru Do you know if the controller role was unauthorized to launch with an instance type, would those instance types still appear in the DescribeInstanceTypes? I don't think there's a way for us to discover this without launching the instance and deleting the instance could lead to us just retrying to launch the same instance over and over again in an infinite loop.

Does it work for you to update your NodePool instance types that you are selecting to match the IAM policy that you are specifying?

@jonathan-innis I'm not sure we are understanding each other.

When I talk about configuring the NodePool, I am talking specifically about updating the NodePool requirements:

requirements
  - key: "karpenter.k8s.aws/instance-family"
    operator: "In"
    values:
      - c5a
      - c6a
      - c7a
      ...

What I expected to happen when I changed the NodePool configuration to require an instance family from a list, and t4 was not in the list, was that the NodeClaim for a t4.small would be marked for disruption, the nodes related to the node claim shut down in an orderly fashion, and then the NodePool deleted in under 3 minutes. Since that NodeClaim never successfully launched an instance due to the SCP, it should have just been deleted immediately in response to the requirements changing. What I am complaining about is that this did not happen, but rather the NodeClaim continued to try to launch a t4.small instance even after the change in requirements.

To answer your first question, yes, DescribeInstanceTypes will return instance types the controller is not allowed to launch.

I believe Karpenter currently removes from its available instance type lists any instance types for which AWS reports no availability, and caches this result for 45 minutes. As a secondary feature request, I suggest you similarly cache a ban on launching instance types for which you get a permission denied error when trying to launch it. I would cache this ban for an hour, but only in memory, so the cache would be flushed by killing the controller pod.

@jonathan-innis I am not sure I understand what you mean by saying

I don't think there's a way for us to discover this without launching the instance and deleting the instance could lead to us just retrying to launch the same instance over and over again in an infinite loop.

The observed behavior is that the NodeClaim attempts to launch at4.small over and over in a loop (apparently finite only due to a 15 minute timeout) and gets a "permission denied" error each time due to the SCP.

You only need to launch an instance in order to fulfill a need, and if the launch succeeds, there is no immediate need to delete the instance.

Does it work for you to update your NodePool instance types that you are selecting to match the IAM policy that you are specifying?

Yes, of course, if we got that right in the first place, everything works well. My particular concern for the future is that I would like it to suffice that I include in the requirements:

  - key: "karpenter.k8s.aws/instance-encryption-in-transit-supported"
    operator: "In"
    values: ["true"]

However, I cannot write an exactly equivalent SCP. The best I can do is write an SCP that limits the instances that can be launched by instance family, and generate that list for a given point in time and region. As new instance families are added, there will be a lag between when those new instances would be allowed by the karpenter.k8s.aws/instance-encryption-in-transit-supported label and when the SCP is updated to allow for the new instance families. It would be nice if Karpenter would be more efficient in dealing with the discrepancies in the transition period. As it is now, it blocks scale-up indefinitely if Karpenter happens to choose to launch an instance type from a family allowed by the NodePool spec but not the SCP.

Nuru added bug Something isn't working needs-triage Issues that need to be triaged labels Apr 25, 2024

engedaam removed the needs-triage Issues that need to be triaged label May 2, 2024

engedaam added feature New feature or request and removed bug Something isn't working labels May 3, 2024

jonathan-innis assigned engedaam May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When NodePool is updated, Karpenter fails to remove Failing NodeClaim #6098

When NodePool is updated, Karpenter fails to remove Failing NodeClaim #6098

Nuru commented Apr 25, 2024

engedaam commented Apr 30, 2024 •

edited

rtnpro commented May 3, 2024

Nuru commented May 3, 2024

engedaam commented May 3, 2024 •

edited

jonathan-innis commented May 13, 2024

Nuru commented May 13, 2024 •

edited

When NodePool is updated, Karpenter fails to remove Failing NodeClaim #6098

When NodePool is updated, Karpenter fails to remove Failing NodeClaim #6098

Comments

Nuru commented Apr 25, 2024

Description

engedaam commented Apr 30, 2024 • edited

rtnpro commented May 3, 2024

Nuru commented May 3, 2024

engedaam commented May 3, 2024 • edited

jonathan-innis commented May 13, 2024

Nuru commented May 13, 2024 • edited

engedaam commented Apr 30, 2024 •

edited

engedaam commented May 3, 2024 •

edited

Nuru commented May 13, 2024 •

edited