Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When NodePool is updated, Karpenter fails to remove Failing NodeClaim #6098

Open
Nuru opened this issue Apr 25, 2024 · 6 comments
Open

When NodePool is updated, Karpenter fails to remove Failing NodeClaim #6098

Nuru opened this issue Apr 25, 2024 · 6 comments
Assignees
Labels
feature New feature or request

Comments

@Nuru
Copy link

Nuru commented Apr 25, 2024

Description

Observed Behavior:

  1. Deploy a Service Control Policy (SCP) limiting what kinds of instances can be launched
  2. Deploy Karpenter with a single NodePool allowing all instances
  3. Launch Pod compatible with NodePool but not compatible with any running nodes
  4. Karpenter creates a claim which specifies an instance type not allowed by SCP. This is not guaranteed, and not by itself an error. However, when it happens, Karpenter creates a NodeClaim that will forever fail.
  5. Get error from Karpenter: launching nodeclaim, creating instance, with fleet error(s), UnauthorizedOperation
  6. Update NodePool to align with SCP.
  7. Karpenter continues to try to launch the failing NodeClaim, and the pod that triggered it remains unschedulable.

When I ran into this behavior, it actually went like this:

  • NodePool requires arch amd64
  • Pod requires arch arm64
  • Update NodePool to allow arm64
  • Karpenter creates NodeClaim for t4.small
  • Karpenter reconciler errors out trying to launch NodeClaim: operation not allowed due to explicit deny in SCP
  • Update NodePool again with explicit list of allowed instance families
  • Karpenter reconciler continues error loop, pod continues to be unschedulable
  • I manually delete failing NodeClaim
  • Karpenter creates new NodeClaim, successfully launches allowed instance type, pod is scheduled and deployed

Expected Behavior:

When NodePool is updated, Karpenter removes any NodeClaims that are disallowed by the new NodePool requirements, and creates new NodeClaims if needed to schedule pods.

Reproduction Steps (Please include YAML):

See above

Versions:

  • Chart Version: 0.36.0
  • Kubernetes Version (kubectl version): 1.26+
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@Nuru Nuru added bug Something isn't working needs-triage Issues that need to be triaged labels Apr 25, 2024
@engedaam
Copy link
Contributor

engedaam commented Apr 30, 2024

Currently, Karpenter does not remove instance types that receive UnauthorizedOperation for its list of instance types. What we would expect in the your case would be that Karpetner will delete the NodeClaim after 15 minutes of failing to launch a node and then create a new NodeClaim would be created that would be within the NodePool configured set of instances types. Did you see the NodeClaim get removed after 15 minutes? Also, since your NodeClaim never launched a node, drift would not have reacted.

@engedaam engedaam removed the needs-triage Issues that need to be triaged label May 2, 2024
@rtnpro
Copy link

rtnpro commented May 3, 2024

Hey @engedaam I want to work on this issue.

@Nuru
Copy link
Author

Nuru commented May 3, 2024

@engedaam No, I did not wait 15 minutes to see if the NodePool would be deleted as failing.

I expected that when I updated the list of acceptable instance types, any NodePools, failing or not, that specified instance types that were no longer allowed, would be deleted immediately.

@engedaam
Copy link
Contributor

engedaam commented May 3, 2024

@Nuru Karpenter currently has a check to validate that a node is first launched before deleting the NodeClaim. We wait for the launch now, because we don't know the exact node that we will get since that is give after the launch of the instance. https://github.com/kubernetes-sigs/karpenter/blob/e4abe9387198cacfece0eac53f241a3e7dd2ac76/pkg/controllers/nodeclaim/disruption/drift.go#L62

@engedaam engedaam added feature New feature or request and removed bug Something isn't working labels May 3, 2024
@jonathan-innis
Copy link
Contributor

I expected that when I updated the list of acceptable instance types, any NodePools, failing or not, that specified instance types that were no longer allowed, would be deleted immediately

@Nuru Do you know if the controller role was unauthorized to launch with an instance type, would those instance types still appear in the DescribeInstanceTypes? I don't think there's a way for us to discover this without launching the instance and deleting the instance could lead to us just retrying to launch the same instance over and over again in an infinite loop.

Does it work for you to update your NodePool instance types that you are selecting to match the IAM policy that you are specifying?

@Nuru
Copy link
Author

Nuru commented May 13, 2024

I expected that when I updated the list of acceptable instance types, any NodePools, failing or not, that specified instance types that were no longer allowed, would be deleted immediately

@Nuru Do you know if the controller role was unauthorized to launch with an instance type, would those instance types still appear in the DescribeInstanceTypes? I don't think there's a way for us to discover this without launching the instance and deleting the instance could lead to us just retrying to launch the same instance over and over again in an infinite loop.

Does it work for you to update your NodePool instance types that you are selecting to match the IAM policy that you are specifying?

@jonathan-innis I'm not sure we are understanding each other.

When I talk about configuring the NodePool, I am talking specifically about updating the NodePool requirements:

requirements
  - key: "karpenter.k8s.aws/instance-family"
    operator: "In"
    values:
      - c5a
      - c6a
      - c7a
      ...

What I expected to happen when I changed the NodePool configuration to require an instance family from a list, and t4 was not in the list, was that the NodeClaim for a t4.small would be marked for disruption, the nodes related to the node claim shut down in an orderly fashion, and then the NodePool deleted in under 3 minutes. Since that NodeClaim never successfully launched an instance due to the SCP, it should have just been deleted immediately in response to the requirements changing. What I am complaining about is that this did not happen, but rather the NodeClaim continued to try to launch a t4.small instance even after the change in requirements.

To answer your first question, yes, DescribeInstanceTypes will return instance types the controller is not allowed to launch.

I believe Karpenter currently removes from its available instance type lists any instance types for which AWS reports no availability, and caches this result for 45 minutes. As a secondary feature request, I suggest you similarly cache a ban on launching instance types for which you get a permission denied error when trying to launch it. I would cache this ban for an hour, but only in memory, so the cache would be flushed by killing the controller pod.

@jonathan-innis I am not sure I understand what you mean by saying

I don't think there's a way for us to discover this without launching the instance and deleting the instance could lead to us just retrying to launch the same instance over and over again in an infinite loop.

The observed behavior is that the NodeClaim attempts to launch at4.small over and over in a loop (apparently finite only due to a 15 minute timeout) and gets a "permission denied" error each time due to the SCP.

You only need to launch an instance in order to fulfill a need, and if the launch succeeds, there is no immediate need to delete the instance.

Does it work for you to update your NodePool instance types that you are selecting to match the IAM policy that you are specifying?

Yes, of course, if we got that right in the first place, everything works well. My particular concern for the future is that I would like it to suffice that I include in the requirements:

  - key: "karpenter.k8s.aws/instance-encryption-in-transit-supported"
    operator: "In"
    values: ["true"]

However, I cannot write an exactly equivalent SCP. The best I can do is write an SCP that limits the instances that can be launched by instance family, and generate that list for a given point in time and region. As new instance families are added, there will be a lag between when those new instances would be allowed by the karpenter.k8s.aws/instance-encryption-in-transit-supported label and when the SCP is updated to allow for the new instance families. It would be nice if Karpenter would be more efficient in dealing with the discrepancies in the transition period. As it is now, it blocks scale-up indefinitely if Karpenter happens to choose to launch an instance type from a family allowed by the NodePool spec but not the SCP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants