The default anti affinity rule could cause all karpenter pods run into pending state in some edge case #6160

fivesheep · 2024-05-06T18:49:36Z

Description

Observed Behavior:
In the current karpenter helm. the anti-affinity has configured this value:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: karpenter.sh/nodepool
              operator: DoesNotExist
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - topologyKey: "kubernetes.io/hostname"

Given it's a map, it makes it even worst, that the value can't be overridden with a different config, except set it to null to disable it entirely.

With the above setting, basically, it means any nodes bring up by karpenter can't be used by karpenter pods. and there's also no proper PDB setting to prevent that.

What happened to us, some high-priority (wrongly configured) pods were originally running on the same node groups that were managed by karpenter, while over time, when those karpetner nodes got recycled due to ttl, all those pods get scheduled to the ASG managed nodes which we brought up during cluster creation, and due to the wrongly configured priority setting, all karpenter pods got kicked out from the nodes they were running to make room for these wrongly configured pod. since the ASG didn't have cluster autoscaler to control, it ended up no karpenter pod was able to run due to the above anti affinity setting.

Per previous experience with cluster auto scaler, they were running fine on any nodes regardless who managed them, I don't see a need to have it here either. I'd suggest to remove it, and also set a proper pdb for karpenter

Expected Behavior:
Karpenter can always run

Reproduction Steps (Please include YAML):
just put some pod with higher priority on the nodes running karpenter to force them to get kicked out

Versions:

Chart Version: current
Kubernetes Version (kubectl version): doesn't matter

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

jonathan-innis · 2024-05-09T22:01:52Z

Given it's a map, it makes it even worst, that the value can't be overridden with a different config, except set it to null to disable it entirely

Helm allows you to override specific keys in the values map. So you should (for instance) be able to --set affinity.nodeAffinity={...} when you are deploying Karpenter.

basically, it means any nodes bring up by karpenter can't be used by karpenter pods. and there's also no proper PDB setting to prevent that

Yes, we're pretty cautious about allowing users to create a circular dependency on themselves today. There are definitely people that are bootstrapping Karpenter and then allowing Karpenter to provision capacity for itself and then remove the underlying, original capacity, but it can definitely get dangerous. Karpenter is going to disrupt its own nodes and if both of those nodes go bad at once, you're out of luck because Karpenter is not going to be able to restart to spin up more capacity -- and now you are taking an outage.

all karpenter pods got kicked out from the nodes they were running to make room for these wrongly configured pod

Yeah, I think the right way to solve this is to ensure the MNG that you are using here is fully isolated so that you don't get into this state. High priority, its own taints and nodeSelectors, etc. to ensure that Karpenter is always running.

I'd suggest to remove it, and also set a proper pdb for karpenter

We do have a PodDisruptionBudget set for Karpenter, you can find it here. Did you not see it work when the Karpenter pods got evicted?

fivesheep · 2024-05-09T22:35:17Z

@jonathan-innis for whatever reason, PDB doesn't help in that case.

for the default nodeAffinity, remember, that's a map. and for map, the operation is merge say if i want to set a different affinity rule, it won't allow me to do it. because require.... and prefer... are different keys, and they can't co-exist. well, the only thing we can override this is to set it to null, but then we can't have any affinity setting if we do so.

we had use cluster-auto-scaler before, it also runs on nodes it managed, and we didn't see any issue. is that mean for karpetner we already need to have a group of nodes to be not managed by it, and dedicated to karpenter. sure, that can be done by adding some taints to it, but it increase maintaining cost. and always need to babysit this different node group, like updating...etc.

Since we have PDB, allowing karpenter to run on the nodes it managed shouldn't cause the issue you mentioned.

jonathan-innis · 2024-05-10T14:50:57Z

it won't allow me to do it

I'm able to get it to work. Just as an example, here's the --set commands that I'm running to make helm ignore the required affinity and set a preferred one.

--set affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution=null \
--set affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].weight=100 \
--set affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key="karpenter.sh/nodepool" \
--set affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].operator="DoesNotExist" \

That produces

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - preference:
         matchExpressions:
         - key: karpenter.sh/nodepool
            operator: DoesNotExist
      weight: 100

Now, I'm still not sure I would recommend this configuration, but this is definitely something that's achievable by overriding the default values with helm.

allowing karpenter to run on the nodes it managed shouldn't cause the issue you mentioned

I'm not saying it's guaranteed to cause it, but I am saying that it can be risky. Without any back-up, if the kubelet happens to go down on a Karpenter-managed node while another one is affected by a PDB, you are out of luck.

that can be done by adding some taints to it, but it increase maintaining cost

You can avoid this by running EKS Fargate

for whatever reason, PDB doesn't help in that case

Yeah, I have no clue about that. If it didn't work in that case (and it was there), how are you expecting it to work when you are running Karpenter pods on Karpenter-managed capacity?

github-actions · 2024-05-25T12:04:04Z

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.

fivesheep added bug Something isn't working needs-triage Issues that need to be triaged labels May 6, 2024

jonathan-innis added question Further information is requested and removed bug Something isn't working needs-triage Issues that need to be triaged labels May 9, 2024

jonathan-innis self-assigned this May 9, 2024

github-actions bot added the lifecycle/stale label May 25, 2024

github-actions bot added the lifecycle/closed label Jun 8, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The default anti affinity rule could cause all karpenter pods run into pending state in some edge case #6160

The default anti affinity rule could cause all karpenter pods run into pending state in some edge case #6160

fivesheep commented May 6, 2024

jonathan-innis commented May 9, 2024

fivesheep commented May 9, 2024

jonathan-innis commented May 10, 2024 •

edited

github-actions bot commented May 25, 2024

The default anti affinity rule could cause all karpenter pods run into pending state in some edge case #6160

The default anti affinity rule could cause all karpenter pods run into pending state in some edge case #6160

Comments

fivesheep commented May 6, 2024

Description

jonathan-innis commented May 9, 2024

fivesheep commented May 9, 2024

jonathan-innis commented May 10, 2024 • edited

github-actions bot commented May 25, 2024

jonathan-innis commented May 10, 2024 •

edited