-
Notifications
You must be signed in to change notification settings - Fork 835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When NodePool is updated, Karpenter fails to remove Failing NodeClaim #6098
Comments
Currently, Karpenter does not remove instance types that receive |
Hey @engedaam I want to work on this issue. |
@engedaam No, I did not wait 15 minutes to see if the NodePool would be deleted as failing. I expected that when I updated the list of acceptable instance types, any NodePools, failing or not, that specified instance types that were no longer allowed, would be deleted immediately. |
@Nuru Karpenter currently has a check to validate that a node is first launched before deleting the NodeClaim. We wait for the launch now, because we don't know the exact node that we will get since that is give after the launch of the instance. https://github.com/kubernetes-sigs/karpenter/blob/e4abe9387198cacfece0eac53f241a3e7dd2ac76/pkg/controllers/nodeclaim/disruption/drift.go#L62 |
@Nuru Do you know if the controller role was unauthorized to launch with an instance type, would those instance types still appear in the DescribeInstanceTypes? I don't think there's a way for us to discover this without launching the instance and deleting the instance could lead to us just retrying to launch the same instance over and over again in an infinite loop. Does it work for you to update your NodePool instance types that you are selecting to match the IAM policy that you are specifying? |
@jonathan-innis I'm not sure we are understanding each other. When I talk about configuring the NodePool, I am talking specifically about updating the NodePool requirements: requirements
- key: "karpenter.k8s.aws/instance-family"
operator: "In"
values:
- c5a
- c6a
- c7a
... What I expected to happen when I changed the NodePool configuration to require an instance family from a list, and To answer your first question, yes, DescribeInstanceTypes will return instance types the controller is not allowed to launch. I believe Karpenter currently removes from its available instance type lists any instance types for which AWS reports no availability, and caches this result for 45 minutes. As a secondary feature request, I suggest you similarly cache a ban on launching instance types for which you get a permission denied error when trying to launch it. I would cache this ban for an hour, but only in memory, so the cache would be flushed by killing the controller pod. @jonathan-innis I am not sure I understand what you mean by saying
The observed behavior is that the NodeClaim attempts to launch a You only need to launch an instance in order to fulfill a need, and if the launch succeeds, there is no immediate need to delete the instance.
Yes, of course, if we got that right in the first place, everything works well. My particular concern for the future is that I would like it to suffice that I include in the requirements: - key: "karpenter.k8s.aws/instance-encryption-in-transit-supported"
operator: "In"
values: ["true"] However, I cannot write an exactly equivalent SCP. The best I can do is write an SCP that limits the instances that can be launched by instance family, and generate that list for a given point in time and region. As new instance families are added, there will be a lag between when those new instances would be allowed by the |
Description
Observed Behavior:
launching nodeclaim, creating instance, with fleet error(s), UnauthorizedOperation
When I ran into this behavior, it actually went like this:
amd64
arm64
arm64
t4.small
Expected Behavior:
When NodePool is updated, Karpenter removes any NodeClaims that are disallowed by the new NodePool requirements, and creates new NodeClaims if needed to schedule pods.
Reproduction Steps (Please include YAML):
See above
Versions:
kubectl version
): 1.26+The text was updated successfully, but these errors were encountered: