-
Notifications
You must be signed in to change notification settings - Fork 835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The default anti affinity rule could cause all karpenter pods run into pending state in some edge case #6160
Comments
Helm allows you to override specific keys in the values map. So you should (for instance) be able to
Yes, we're pretty cautious about allowing users to create a circular dependency on themselves today. There are definitely people that are bootstrapping Karpenter and then allowing Karpenter to provision capacity for itself and then remove the underlying, original capacity, but it can definitely get dangerous. Karpenter is going to disrupt its own nodes and if both of those nodes go bad at once, you're out of luck because Karpenter is not going to be able to restart to spin up more capacity -- and now you are taking an outage.
Yeah, I think the right way to solve this is to ensure the MNG that you are using here is fully isolated so that you don't get into this state. High priority, its own taints and nodeSelectors, etc. to ensure that Karpenter is always running.
We do have a PodDisruptionBudget set for Karpenter, you can find it here. Did you not see it work when the Karpenter pods got evicted? |
@jonathan-innis for whatever reason, PDB doesn't help in that case. for the default nodeAffinity, remember, that's a map. and for map, the operation is we had use cluster-auto-scaler before, it also runs on nodes it managed, and we didn't see any issue. is that mean for karpetner we already need to have a group of nodes to be not managed by it, and dedicated to karpenter. sure, that can be done by adding some taints to it, but it increase maintaining cost. and always need to babysit this different node group, like updating...etc. Since we have PDB, allowing karpenter to run on the nodes it managed shouldn't cause the issue you mentioned. |
I'm able to get it to work. Just as an example, here's the --set affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution=null \
--set affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].weight=100 \
--set affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key="karpenter.sh/nodepool" \
--set affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].operator="DoesNotExist" \ That produces affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: karpenter.sh/nodepool
operator: DoesNotExist
weight: 100 Now, I'm still not sure I would recommend this configuration, but this is definitely something that's achievable by overriding the default values with helm.
I'm not saying it's guaranteed to cause it, but I am saying that it can be risky. Without any back-up, if the kubelet happens to go down on a Karpenter-managed node while another one is affected by a PDB, you are out of luck.
You can avoid this by running EKS Fargate
Yeah, I have no clue about that. If it didn't work in that case (and it was there), how are you expecting it to work when you are running Karpenter pods on Karpenter-managed capacity? |
This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity. |
Description
Observed Behavior:
In the current karpenter helm. the anti-affinity has configured this value:
Given it's a map, it makes it even worst, that the value can't be overridden with a different config, except set it to null to disable it entirely.
With the above setting, basically, it means any nodes bring up by karpenter can't be used by karpenter pods. and there's also no proper PDB setting to prevent that.
What happened to us, some high-priority (wrongly configured) pods were originally running on the same node groups that were managed by karpenter, while over time, when those karpetner nodes got recycled due to ttl, all those pods get scheduled to the ASG managed nodes which we brought up during cluster creation, and due to the wrongly configured priority setting, all karpenter pods got kicked out from the nodes they were running to make room for these wrongly configured pod. since the ASG didn't have cluster autoscaler to control, it ended up no karpenter pod was able to run due to the above anti affinity setting.
Per previous experience with cluster auto scaler, they were running fine on any nodes regardless who managed them, I don't see a need to have it here either. I'd suggest to remove it, and also set a proper pdb for karpenter
Expected Behavior:
Karpenter can always run
Reproduction Steps (Please include YAML):
just put some pod with higher priority on the nodes running karpenter to force them to get kicked out
Versions:
kubectl version
): doesn't matterThe text was updated successfully, but these errors were encountered: