-
Notifications
You must be signed in to change notification settings - Fork 835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Karpenter won't use fallback nodePool on insufficient instance capacity #6168
Comments
Can you share the NodePool and EC2NodeClass that you are using here? Can you also share the entire set of Karpenter controller logs from the FIS simulation? Can you also share what exactly you are doing/executing during the FIS simulation (it seems like you are just ICE-ing all instance types across the single AZ)? |
Hey Jonathan, thank you for looking into this. I've shared the NodePools definition in the issue description, and here are the EC2NodeClass and Controller logs requested. And yes, in the FIS simulation, we are terminating the Kapenter-managed nodes and ICE-ing all instance types across the eu-west-1a AZ. Also, we use a custom AMI in the EC2NodeClass, but I don't think this is what causes the issue. Looking into the logs again, it seems to me that happens is:
When we do the same tests with |
This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity. |
Description
Observed Behavior:
We have a
default
nodePool in one availability zone (a
) with higher weight and a secondfallback
nodePool in another availability zone (b
) with lower weight. The scenario we test is with AWS Fault Injection Simulator, where alla
AZ instances are terminated and new instance launches are paused.What we observe is new nodeClaims being created but stay in
Non-Ready
state and nodes fail to launch with an error message like:The concerning thing here is that Karpenter won't try to launch instances from the
fallback
nodePool even after several minutes of waiting.The messages in the log are mostly:
If we increase the
fallback
nodePool weight and bump the workload so new nodeClaims are created, they would start in theb
AZ and get provisioned successfully, but thea
AZ nodeClaims will stay pending as well as the pods meant to start on them.Expected Behavior:
Unsuccessful nodeClaims would time out after a couple of seconds, and new ones will be created in an alternative nodePool. Also, having more verbose log messages would be appreciated.
Reproduction Steps (Please include YAML):
NodePools
Workload
FIS template
Versions:
0.35.0
kubectl version
):1.22.17
The text was updated successfully, but these errors were encountered: