Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter fails to schedule a pending pod with a preferred affinity #1204

Open
wmgroot opened this issue Apr 23, 2024 · 4 comments
Open

Karpenter fails to schedule a pending pod with a preferred affinity #1204

wmgroot opened this issue Apr 23, 2024 · 4 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@wmgroot
Copy link

wmgroot commented Apr 23, 2024

Description

Observed Behavior:
We have a pod stuck in pending indefinitely and Karpenter does not take action to add a new node to allow the pod to schedule.

$ kubectl get pod -n capa-system
NAME                                      READY   STATUS    RESTARTS   AGE
capa-controller-manager-7c6f4fbf6-2wxxr   0/1     Pending   0          4d5h
capa-controller-manager-7c6f4fbf6-9lsd6   1/1     Running   0          4d5h

The pod has a soft affinity to prefer controlplane nodes. Given this is an EKS cluster, this pod can never schedule on a controlplane node.

    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: node-role.kubernetes.io/control-plane
                operator: Exists
            weight: 10

Expected Behavior:
Karpenter creates a node to allow the pod to schedule even though the pod has a soft affinity preference that cannot be satisfied. Not scheduling the pod can result in prolonged outages, blocked PDBs and other undesirable behavior that requires manual intervention and is worse than an unsatisfied soft affinity.

Reproduction Steps (Please include YAML):
I believe this should be reproducible with a pod that uses a nodeselector/toleration for an isolated NodePool for easier testing.

  • Any unsatisfiable preferred constraint in an affinity should allow the observed behavior to occur (such as a label that will never exist for nodes in the nodepool).
  • Do note that the pod must not have space to schedule without Karpenter taking action, otherwise K8s will schedule it successfully without satisfying the soft affinity constraint. Using a NodePool that should scale up from 0 is an effective way to test this.

Upon removing the affinity spec from the example above, Karpenter added a node immediately to allow the pod to schedule.

$ kubectl get pod -n capa-system
NAME                                      READY   STATUS    RESTARTS   AGE
capa-controller-manager-7c6f4fbf6-4wc6b   1/1     Running   0          12m
capa-controller-manager-7c6f4fbf6-hqtjb   1/1     Running   0          12m

Versions:

  • Chart Version: 0.35.0
  • Kubernetes Version (kubectl version):
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.1", GitCommit:"e4d4e1ab7cf1bf15273ef97303551b279f0920a9", GitTreeState:"clean", BuildDate:"2022-09-14T19:49:27Z", GoVersion:"go1.19.1", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26+", GitVersion:"v1.26.14-eks-b9c9ed7", GitCommit:"7c3f2be51edd9fa5727b6ecc2c3fc3c578aa02ca", GitTreeState:"clean", BuildDate:"2024-03-02T03:46:35Z", GoVersion:"go1.21.7", Compiler:"gc", Platform:"linux/amd64"}

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@wmgroot wmgroot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 23, 2024
@tzneal
Copy link
Contributor

tzneal commented Apr 23, 2024

Its not just a preferred node affinity that isn't satisfiable that is causing this. Its due to the label being for a restricted domain (node-role.kubernetes.io/control-plane). If you modify the label to be something else, Karpenter will launch capacity for the pod.

karpenter-5bb56f6d9b-l8v4x controller {"level":"DEBUG","time":"2024-04-23T20:03:55.814Z","logger":"controller.disruption","message":"ignoring pod, label node-role.kubernetes.io/control-plane is restricted; specify a well known label: [karpenter.k8s.aws/instance-accelerator-count karpenter.k8s.aws/instance-accelerator-manufacturer karpenter.k8s.aws/instance-accelerator-name karpenter.k8s.aws/instance-category karpenter.k8s.aws/instance-cpu karpenter.k8s.aws/instance-encryption-in-transit-supported karpenter.k8s.aws/instance-family karpenter.k8s.aws/instance-generation karpenter.k8s.aws/instance-gpu-count karpenter.k8s.aws/instance-gpu-manufacturer karpenter.k8s.aws/instance-gpu-memory karpenter.k8s.aws/instance-gpu-name karpenter.k8s.aws/instance-hypervisor karpenter.k8s.aws/instance-local-nvme karpenter.k8s.aws/instance-memory karpenter.k8s.aws/instance-network-bandwidth karpenter.k8s.aws/instance-size karpenter.sh/capacity-type karpenter.sh/nodepool kubernetes.io/arch kubernetes.io/os node.kubernetes.io/instance-type node.kubernetes.io/windows-build topology.kubernetes.io/region topology.kubernetes.io/zone], or a custom label that does not use a restricted domain: [k8s.io karpenter.k8s.aws karpenter.sh kubernetes.io]","commit":"8b2d1d7","pod":"default/test-pod"}

It may not be necessary to validate preferred terms however since they can be ignored.

@tzneal tzneal removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 23, 2024
@wmgroot
Copy link
Author

wmgroot commented Apr 23, 2024

Would you mind linking the code for this special handling to me? I'm searching for any mention of "node-role" or "control-plane" label and not finding anything. I don't think it's safe to just ignore any kubernetes.io label since some of those are reasonable to use for node selection.

@billrayburn
Copy link

/assign @jmdeal

@cnmcavoy
Copy link

cnmcavoy commented May 10, 2024

@wmgroot I believe what is happening is that the affinity causes Karpenter to compute a nodeclaim with the restricted domain as one of it's labels. Then, Karpenter validates the nodeclaim, detects this restricted label, and determines it's an unsatisfiable nodeclaim and can not be created. So Karpenter does not scale up a node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants