Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The default anti affinity rule could cause all karpenter pods run into pending state in some edge case #6160

Closed
fivesheep opened this issue May 6, 2024 · 4 comments
Assignees
Labels
lifecycle/closed lifecycle/stale question Further information is requested

Comments

@fivesheep
Copy link

Description

Observed Behavior:
In the current karpenter helm. the anti-affinity has configured this value:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: karpenter.sh/nodepool
              operator: DoesNotExist
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - topologyKey: "kubernetes.io/hostname"

Given it's a map, it makes it even worst, that the value can't be overridden with a different config, except set it to null to disable it entirely.

With the above setting, basically, it means any nodes bring up by karpenter can't be used by karpenter pods. and there's also no proper PDB setting to prevent that.

What happened to us, some high-priority (wrongly configured) pods were originally running on the same node groups that were managed by karpenter, while over time, when those karpetner nodes got recycled due to ttl, all those pods get scheduled to the ASG managed nodes which we brought up during cluster creation, and due to the wrongly configured priority setting, all karpenter pods got kicked out from the nodes they were running to make room for these wrongly configured pod. since the ASG didn't have cluster autoscaler to control, it ended up no karpenter pod was able to run due to the above anti affinity setting.

Per previous experience with cluster auto scaler, they were running fine on any nodes regardless who managed them, I don't see a need to have it here either. I'd suggest to remove it, and also set a proper pdb for karpenter

Expected Behavior:
Karpenter can always run

Reproduction Steps (Please include YAML):
just put some pod with higher priority on the nodes running karpenter to force them to get kicked out

Versions:

  • Chart Version: current
  • Kubernetes Version (kubectl version): doesn't matter
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@fivesheep fivesheep added bug Something isn't working needs-triage Issues that need to be triaged labels May 6, 2024
@jonathan-innis
Copy link
Contributor

Given it's a map, it makes it even worst, that the value can't be overridden with a different config, except set it to null to disable it entirely

Helm allows you to override specific keys in the values map. So you should (for instance) be able to --set affinity.nodeAffinity={...} when you are deploying Karpenter.

basically, it means any nodes bring up by karpenter can't be used by karpenter pods. and there's also no proper PDB setting to prevent that

Yes, we're pretty cautious about allowing users to create a circular dependency on themselves today. There are definitely people that are bootstrapping Karpenter and then allowing Karpenter to provision capacity for itself and then remove the underlying, original capacity, but it can definitely get dangerous. Karpenter is going to disrupt its own nodes and if both of those nodes go bad at once, you're out of luck because Karpenter is not going to be able to restart to spin up more capacity -- and now you are taking an outage.

all karpenter pods got kicked out from the nodes they were running to make room for these wrongly configured pod

Yeah, I think the right way to solve this is to ensure the MNG that you are using here is fully isolated so that you don't get into this state. High priority, its own taints and nodeSelectors, etc. to ensure that Karpenter is always running.

I'd suggest to remove it, and also set a proper pdb for karpenter

We do have a PodDisruptionBudget set for Karpenter, you can find it here. Did you not see it work when the Karpenter pods got evicted?

@jonathan-innis jonathan-innis added question Further information is requested and removed bug Something isn't working needs-triage Issues that need to be triaged labels May 9, 2024
@jonathan-innis jonathan-innis self-assigned this May 9, 2024
@fivesheep
Copy link
Author

@jonathan-innis for whatever reason, PDB doesn't help in that case.

for the default nodeAffinity, remember, that's a map. and for map, the operation is merge say if i want to set a different affinity rule, it won't allow me to do it. because require.... and prefer... are different keys, and they can't co-exist. well, the only thing we can override this is to set it to null, but then we can't have any affinity setting if we do so.

we had use cluster-auto-scaler before, it also runs on nodes it managed, and we didn't see any issue. is that mean for karpetner we already need to have a group of nodes to be not managed by it, and dedicated to karpenter. sure, that can be done by adding some taints to it, but it increase maintaining cost. and always need to babysit this different node group, like updating...etc.

Since we have PDB, allowing karpenter to run on the nodes it managed shouldn't cause the issue you mentioned.

@jonathan-innis
Copy link
Contributor

jonathan-innis commented May 10, 2024

it won't allow me to do it

I'm able to get it to work. Just as an example, here's the --set commands that I'm running to make helm ignore the required affinity and set a preferred one.

--set affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution=null \
--set affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].weight=100 \
--set affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key="karpenter.sh/nodepool" \
--set affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].operator="DoesNotExist" \

That produces

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - preference:
         matchExpressions:
         - key: karpenter.sh/nodepool
            operator: DoesNotExist
      weight: 100

Now, I'm still not sure I would recommend this configuration, but this is definitely something that's achievable by overriding the default values with helm.

allowing karpenter to run on the nodes it managed shouldn't cause the issue you mentioned

I'm not saying it's guaranteed to cause it, but I am saying that it can be risky. Without any back-up, if the kubelet happens to go down on a Karpenter-managed node while another one is affected by a PDB, you are out of luck.

that can be done by adding some taints to it, but it increase maintaining cost

You can avoid this by running EKS Fargate

for whatever reason, PDB doesn't help in that case

Yeah, I have no clue about that. If it didn't work in that case (and it was there), how are you expecting it to work when you are running Karpenter pods on Karpenter-managed capacity?

Copy link
Contributor

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/closed lifecycle/stale question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants