-
Notifications
You must be signed in to change notification settings - Fork 828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Karpenter does not disrupt #6086
Comments
Can you show the events from |
I have the same issue and I see this event |
Can you check the pods on the node? Our scheduling simulation thinks it would need more nodes if it were removed, so there is likely a preferred topology spread or anti-affinity on the pods on the node. |
@haim-bp any updates here? |
I encountered a similar Issue. Karpenter Disruption IssueI have been testing Karpenter and encountered an issue that disruption suddenly do not work. Situation
The nodepool is as follows: spec:
template:
spec:
requirements:
- key: "node.kubernetes.io/instance-type"
operator: In
values: ["m6a.4xlarge"]
minValues: 1
- key: "karpenter.sh/capacity-type"
operator: In
values: ["on-demand"]
disruption:
budgets:
- nodes: 20%
- schedule: "0,30 * * * *"
duration: 5m
nodes: "0"
limits:
cpu: "80000"
memory: 320Gi
For instance, after disrupt from 5 to 4 to 3 to 2 nodes, the remaining 2 nodes do not get disrupted. |
EventsThe disruption events are as follows.
NodeClaimThe NodeClaim
apiVersion: karpenter.sh/v1beta1
kind: NodeClaim
metadata:
annotations:
karpenter.k8s.aws/ec2nodeclass-hash: "4618831760887766303"
karpenter.k8s.aws/ec2nodeclass-hash-version: v2
karpenter.k8s.aws/tagged: "true"
karpenter.sh/nodepool-hash: "16961008295110681836"
karpenter.sh/nodepool-hash-version: v2
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"karpenter.k8s.aws/v1beta1","kind":"EC2NodeClass","metadata":{"annotations":{"kubernetes.io/description":"EC2NodeClass for worker node with custom userdata"},"name":"node"},"spec":{"amiFamily":"AL2","amiSelectorTerms":[{"id":"ami-0594c768bd89780c7"}],"associatePublicIPAddress":false,"blockDeviceMappings":[{"deviceName":"/dev/xvda","ebs":{"deleteOnTermination":true,"iops":3000,"throughput":125,"volumeSize":"100Gi","volumeType":"gp3"}},{"deviceName":"/dev/xvdba","ebs":{"deleteOnTermination":true,"volumeSize":"2000Gi","volumeType":"st1"}}],"instanceProfile":"台-eks-cluster-staging","metadataOptions":{"httpEndpoint":"enabled","httpPutResponseHopLimit":2,"httpTokens":"required"},"securityGroupSelectorTerms":[{"tags":{"karpenter.sh/discovery":"my-project-eks-cluster-staging-node"}}],"subnetSelectorTerms":[{"tags":{"karpenter.sh/discovery":"my-project-eks-cluster-staging"}}],"tags":{"Name":"my-project-eks-cluster-staging-1_28"},"userData":"export AWS_MAX_ATTEMPTS=6\nNODE_TYPE=node\nS3_BUCKET_NAME=my-project-eks-configs-staging\nSLACK_WEBHOOK_URL_SSM_KEY=\"/my-project_ops/SLACK_WEBHOOK_URL/my-project_alerts\"\n\npost_slack () {\n local message=\"$1\"\n SLACK_WEBHOOK_URL=$(aws ssm get-parameter --name \"$SLACK_WEBHOOK_URL_SSM_KEY\" --with-decryption | jq -r '.Parameter.Value')\n TOKEN=$(curl -s -X PUT -H \"X-aws-ec2-metadata-token-ttl-seconds: 300\" \"http://169.254.169.254/latest/api/token\")\n INSTANCE_ID=$(curl -s -H \"X-aws-ec2-metadata-token: $TOKEN\
" \"http://169.254.169.254/latest/meta-data/instance-id\")\n curl -X POST -H 'Content-type: application/json' -d \"{\\\"text\\\":\\\"
$INSTANCE_ID: $message\\\"}\" ${SLACK_WEBHOOK_URL}\n echo \"$message\"\n}\n\naws s3 cp s3://\"$S3_BUCKET_NAME\"/userdata/\"$NODE_TYPE
\"-enc.sh /var/lib/cloud/\nif [ $? -ne 0 ]; then\n post_slack \"Error: Failed to download user_data (s3://${S3_BUCKET_NAME}/userdata/
${NODE_TYPE}-enc.sh)\"\n exit 1\nfi\nbase64 -d /var/lib/cloud/\"$NODE_TYPE\"-enc.sh \u003e /var/lib/cloud/userdata-\"$NODE_TYPE\".sh\
nchmod 755 /var/lib/cloud/userdata-\"$NODE_TYPE\".sh\n/var/lib/cloud/userdata-\"$NODE_TYPE\".sh\n"}}
kubernetes.io/description: EC2NodeClass for worker node with custom userdata
creationTimestamp: "2024-05-17T04:25:28Z"
finalizers:
- karpenter.sh/termination
generateName: node-
generation: 1
labels:
karpenter.k8s.aws/instance-cpu: "16"
karpenter.k8s.aws/instance-cpu-manufacturer: amd
karpenter.k8s.aws/instance-encryption-in-transit-supported: "true"
karpenter.k8s.aws/instance-family: m6a
karpenter.k8s.aws/instance-generation: "6"
karpenter.k8s.aws/instance-hypervisor: nitro
karpenter.k8s.aws/instance-memory: "65536"
karpenter.k8s.aws/instance-network-bandwidth: "6250"
karpenter.k8s.aws/instance-size: 4xlarge
karpenter.sh/capacity-type: on-demand
karpenter.sh/nodepool: node
kubernetes.io/arch: amd64
kubernetes.io/os: linux
node.kubernetes.io/instance-type: m6a.4xlarge
topology.kubernetes.io/region: ap-northeast-1
topology.kubernetes.io/zone: ap-northeast-1a
my-project.io/node: "true"
name: node-zl8z4
ownerReferences:
- apiVersion: karpenter.sh/v1beta1
blockOwnerDeletion: true
kind: NodePool
name: node
resourceVersion: "110875561"
uid: b618e73b-4f6d-45a0-89a4-4061eeae1131
spec:
nodeClassRef:
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
name: node
requirements:
- key: karpenter.sh/nodepool
operator: In
values:
- node
- key: node.kubernetes.io/instance-type
minValues: 1
operator: In
values:
- m6a.4xlarge
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: my-project.io/node
operator: In
values:
- "true"
resources:
requests:
memory: "15799001856"
pods: "16"
status:
allocatable:
cpu: 15890m
ephemeral-storage: 89Gi
memory: 57691Mi
pods: "234"
vpc.amazonaws.com/pod-eni: "54"
capacity:
cpu: "16"
ephemeral-storage: 100Gi
memory: 60620Mi
pods: "234"
vpc.amazonaws.com/pod-eni: "54"
0s Normal RemovingNode node/ip-10-80-97-82.ap-northeast-1.compute.internal Node ip-10-80-97-82.ap-northeast-1.compute.internal event: Removing Node ip-10-80-97-82.ap-northeast-1.compute.internal from Controller
0s Normal Completed job/my-project-transfer-job-health-detector-28598705 Job completed
0s Normal SuccessfulDelete cronjob/my-project-transfer-job-health-detector Deleted job my-project-transfer-job-health-detector-28598695
uid: b618e73b-4f6d-45a0-89a4-4061eeae1131
spec:
nodeClassRef:
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
name: node
requirements:
- key: karpenter.sh/nodepool
operator: In
values:
- node
- key: node.kubernetes.io/instance-type
minValues: 1
operator: In
values:
- m6a.4xlarge
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: my-project.io/node
operator: In
values:
- "true"
resources:
requests:
cpu: 14935m
memory: "15799001856"
pods: "16"
status:
allocatable:
cpu: 15890m
ephemeral-storage: 89Gi
memory: 57691Mi
pods: "234"
vpc.amazonaws.com/pod-eni: "54"
capacity:
cpu: "16"
ephemeral-storage: 100Gi
memory: 60620Mi
pods: "234"
vpc.amazonaws.com/pod-eni: "54"
conditions:
- lastTransitionTime: "2024-05-17T04:28:46Z"
status: "True"
type: Initialized
- lastTransitionTime: "2024-05-17T04:25:31Z"
status: "True"
type: Launched
- lastTransitionTime: "2024-05-17T04:28:46Z"
status: "True"
type: Ready
- lastTransitionTime: "2024-05-17T04:27:53Z"
status: "True"
type: Registered
imageID: ami-0594c768bd89780c7
nodeName: ip-10-80-87-240.ap-northeast-1.compute.internal
providerID: aws:///ap-northeast-1a/i-0011a40e85bd3e97a NodeNode
apiVersion: v1
kind: Node
metadata:
annotations:
alpha.kubernetes.io/provided-node-ip: 10.80.87.240
csi.volume.kubernetes.io/nodeid: '{"csi.tigera.io":"ip-10-80-87-240.ap-northeast-1.compute.internal","ebs.csi.aws.com":"i-0011a40e85bd3e97a"}'
karpenter.k8s.aws/ec2nodeclass-hash: "4618831760887766303"
karpenter.k8s.aws/ec2nodeclass-hash-version: v2
karpenter.sh/nodepool-hash: "16961008295110681836"
karpenter.sh/nodepool-hash-version: v2
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"karpenter.k8s.aws/v1beta1","kind":"EC2NodeClass","metadata":{"annotations":{"kubernetes.io/description":"EC2NodeClass for worker node with custom userdata"},"name":"node"},"spec":{"amiFamily":"AL2","amiSelectorTerms":[{"id":"ami-0594c768bd89780c7"}],"associatePublicIPAddress":false,"blockDeviceMappings":[{"deviceName":"/dev/xvda","ebs":{"deleteOnTermination":true,"iops":3000,"throughput":125,"volumeSize":"100Gi","volumeType":"gp3"}},{"deviceName":"/dev/xvdba","ebs":{"deleteOnTermination":true,"volumeSize":"2000Gi","volumeType":"st1"}}],"instanceProfile":"my-project-eks-cluster-staging","metadataOptions":{"httpEndpoint":"enabled","httpPutResponseHopLimit":2,"httpTokens":"required"},"securityGroupSelectorTerms":[{"tags":{"karpenter.sh/discovery":"my-project-eks-cluster-staging-node"}}],"subnetSelectorTerms":[{"tags":{"karpenter.sh/discovery":"my-project-eks-cluster-staging"}}],"tags":{"Name":"my-project-eks-cluster-staging-1_28"},"userData":"export AWS_MAX_ATTEMPTS=6\nNODE_TYPE=node\nS3_BUCKET_NAME=my-project-eks-configs-staging\nSLACK_WEBHOOK_URL_SSM_KEY=\"/my-project_ops/SLACK_WEBHOOK_URL/my-project_alerts\"\n\npost_slack () {\n local message=\"$1\"\n SLACK_WEBHOOK_URL=$(aws ssm get-parameter --name \"$SLACK_WEBHOOK_URL_SSM_KEY\" --with-decryption | jq -r '.Parameter.Value')\n TOKEN=$(curl -s -X PUT -H \"X-aws-ec2-metadata-token-ttl-seconds: 300\" \"http://169.254.169.254/latest/api/token\")\n INSTANCE_ID=$(curl -s -H \"X-aws-ec2-metadata-token: $TOKEN\" \"http://169.254.169.254/latest/meta-data/instance-id\")\n curl -X POST -H 'Content-type: application/json' -d \"{\\\"text\\\":\\\"$INSTANCE_ID: $message\\\"}\" ${SLACK_WEBHOOK_URL}\n echo \"$message\"\n}\n\naws s3 cp s3://\"$S3_BUCKET_NAME\"/userdata/\"$NODE_TYPE\"-enc.sh /var/lib/cloud/\nif [ $? -ne 0 ]; then\n post_slack \"Error: Failed to download user_data (s3://${S3_BUCKET_NAME}/userdata/${NODE_TYPE}-enc.sh)\"\n exit 1\nfi\nbase64 -d /var/lib/cloud/\"$NODE_TYPE\"-enc.sh \u003e /var/lib/cloud/userdata-\"$NODE_TYPE\".sh\nchmod 755 /var/lib/cloud/userdata-\"$NODE_TYPE\".sh\n/var/lib/cloud/userdata-\"$NODE_TYPE\".sh\n"}}
kubernetes.io/description: EC2NodeClass for worker node with custom userdata
node.alpha.kubernetes.io/ttl: "0"
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: "2024-05-17T04:27:52Z"
finalizers:
- karpenter.sh/termination
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: m6a.4xlarge
beta.kubernetes.io/os: linux
failure-domain.beta.kubernetes.io/region: ap-northeast-1
failure-domain.beta.kubernetes.io/zone: ap-northeast-1a
k8s.io/cloud-provider-aws: 0554ca41e8c22b1fa65ef916112e8e20
karpenter.k8s.aws/instance-category: m
karpenter.k8s.aws/instance-cpu: "16"
karpenter.k8s.aws/instance-cpu-manufacturer: amd
karpenter.k8s.aws/instance-encryption-in-transit-supported: "true"
karpenter.k8s.aws/instance-family: m6a
karpenter.k8s.aws/instance-generation: "6"
karpenter.k8s.aws/instance-hypervisor: nitro
karpenter.k8s.aws/instance-memory: "65536"
karpenter.k8s.aws/instance-network-bandwidth: "6250"
karpenter.k8s.aws/instance-size: 4xlarge
karpenter.sh/capacity-type: on-demand
karpenter.sh/initialized: "true"
karpenter.sh/nodepool: node
karpenter.sh/registered: "true"
kubernetes.io/arch: amd64
kubernetes.io/hostname: ip-10-80-87-240.ap-northeast-1.compute.internal
kubernetes.io/os: linux
node.kubernetes.io/instance-type: m6a.4xlarge
topology.ebs.csi.aws.com/zone: ap-northeast-1a
topology.kubernetes.io/region: ap-northeast-1
topology.kubernetes.io/zone: ap-northeast-1a
my-project.io/node: "true"
name: ip-10-80-87-240.ap-northeast-1.compute.internal
ownerReferences:
- apiVersion: karpenter.sh/v1beta1
blockOwnerDeletion: true
- names:
- 602401143452.dkr.ecr.ap-northeast-1.amazonaws.com/eks/pause@sha256:529cf6b1b6e5b76e901abc43aee825badbd93f9c5ee5f1e316d46a83abbce5a2
- 602401143452.dkr.ecr.ap-northeast-1.amazonaws.com/eks/pause:3.5
sizeBytes: 298689
nodeInfo:
architecture: amd64
bootID: a8d24e9e-953f-4624-aecd-158eea384646
containerRuntimeVersion: containerd://1.7.11
kernelVersion: 5.10.210-201.852.amzn2.x86_64
kubeProxyVersion: v1.28.5-eks-5e0fdde
kubeletVersion: v1.28.5-eks-5e0fdde
machineID: ec2a843502ebf3f28b33297833ab212b
operatingSystem: linux
osImage: Amazon Linux 2
systemUUID: ec2a8435-02eb-f3f2-8b33-297833ab212b |
@e-koma can you share the pods that are running on nodes that you believe is should be consolidating? |
@engedaam Thank you for the reply ! kubectl get pod -o yamlget pod
apiVersion: v1
kind: Pod
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
karpenter.sh/do-not-disrupt: "true"
creationTimestamp: "2024-05-17T07:02:37Z"
finalizers:
- batch.kubernetes.io/job-tracking
generateName: transfer-job-6276425-1715929357522-
labels:
app.kubernetes.io/component: transfer-job
app.kubernetes.io/instance: transfer-job-6276425-1715929357522
app.kubernetes.io/managed-by: my-project-manager
app.kubernetes.io/name: transfer-job-6276425
app.kubernetes.io/part-of: transfer
batch.kubernetes.io/controller-uid: ae784a5b-dc5b-4ec0-a3b2-b4dbc370b632
batch.kubernetes.io/job-name: transfer-job-6276425-1715929357522
controller-uid: ae784a5b-dc5b-4ec0-a3b2-b4dbc370b632
job-name: transfer-job-6276425-1715929357522
my-project.io/etl_valid: "false"
name: transfer-job-6276425-1715929357522-c5vdf
namespace: default
ownerReferences:
- apiVersion: batch/v1
blockOwnerDeletion: true
controller: true
kind: Job
name: transfer-job-6276425-1715929357522
uid: ae784a5b-dc5b-4ec0-a3b2-b4dbc370b632
resourceVersion: "111027613"
uid: 5fdd88b1-635a-4bc1-8af0-a63142a241a7
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: job-name
operator: Exists
topologyKey: kubernetes.io/hostname
weight: 10
containers:
- args:
- transfer:run[6276425,false]
envFrom:
- configMapRef:
name: my-project-config
- secretRef:
name: my-project-secret
image:****.dkr.ecr.ap-northeast-1.amazonaws.com/worker.my-project.io:06b18dd453233106abf4e4690efda2aabc067f84
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- sh
- /work/scripts/terminate-worker.sh
name: worker
resources:
limits:
cpu: 3500m
memory: 15Gi
requests:
cpu: 3500m
memory: 15Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /tmp
name: tmp-volume
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-tfx7l
restartCount: 0
started: false
state:
terminated:
containerID: containerd://87f3903744c817d80103abde3e82012a7aacbc672fcbc21c40aff0af886f68ef
exitCode: 0
finishedAt: "2024-05-17T07:16:57Z"
reason: Completed
startedAt: "2024-05-17T07:16:40Z"
hostIP: 10.80.96.71
phase: Running
podIP: 10.80.106.135
podIPs:
- ip: 10.80.106.135
qosClass: Guaranteed
startTime: "2024-05-17T07:16:39Z" kubectl describe poddescribe pod
Name: transfer-job-6276425-1715929357522-c5vdf
Namespace: default
Priority: 0
Service Account: my-project-transfer-job
Node: ip-10-80-96-71.ap-northeast-1.compute.internal/10.80.96.71
Start Time: Fri, 17 May 2024 07:16:39 +0000
Labels: app.kubernetes.io/component=transfer-job
app.kubernetes.io/instance=transfer-job-6276425-1715929357522
app.kubernetes.io/managed-by=my-project-manager
app.kubernetes.io/name=transfer-job-6276425
app.kubernetes.io/part-of=transfer
batch.kubernetes.io/controller-uid=ae784a5b-dc5b-4ec0-a3b2-b4dbc370b632
batch.kubernetes.io/job-name=transfer-job-6276425-1715929357522
controller-uid=ae784a5b-dc5b-4ec0-a3b2-b4dbc370b632
job-name=transfer-job-6276425-1715929357522
my-project.io/etl_valid=false
Annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: false
karpenter.sh/do-not-disrupt: true
Status: Succeeded
IP:
IPs: <none>
Controlled By: Job/transfer-job-6276425-1715929357522
Containers:
worker:
Container ID: containerd://87f3903744c817d80103abde3e82012a7aacbc672fcbc21c40aff0af886f68ef
Image: ****.dkr.ecr.ap-northeast-1.amazonaws.com/worker.my-project.io:06b18dd453233106abf4e4690efda2aabc067f84
Image ID: sha256:de357122189ee012f710a36a4f864ad45b292dff53a6af61da63d003b86238aa
Port: <none>
Host Port: <none>
Args:
transfer:run[6276425,false]
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 17 May 2024 07:16:40 +0000
Finished: Fri, 17 May 2024 07:16:57 +0000
Ready: False
Restart Count: 0
Limits:
cpu: 3500m
memory: 15Gi
Requests:
cpu: 3500m
memory: 15Gi
Environment Variables from:
my-project-config ConfigMap Optional: false
my-project-secret Secret Optional: false
Environment: <none>
Mounts:
/tmp from tmp-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tfx7l (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
tmp-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: 200Gi
kube-api-access-tfx7l:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 14m default-scheduler 0/3 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 1 node(s) had untolerated taint {ToBeDeletedByClusterAutoscaler: 1715929322}, 1 node(s) had untolerated taint {my-project.io/control-plane: true}. preemption: 0/3 nodes are available: 1 No preemption victims found for incoming pod, 2 Preemption is not helpful for scheduling..
Warning FailedScheduling 13m (x7 over 14m) default-scheduler 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had untolerated taint {my-project.io/control-plane: true}. preemption: 0/2 nodes are available: 1 No preemption victims found for incoming pod, 1 Preemption is not helpful for scheduling..
Warning FailedScheduling 12m default-scheduler 0/4 nodes are available: 1 Insufficient cpu, 1 node(s) had untolerated taint {my-project.io/control-plane: true}, 2 node(s) had untolerated taint {node.kubernetes.io/not-ready: }. preemption: 0/4 nodes are available: 1 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling..
Warning FailedScheduling 11m (x2 over 12m) default-scheduler 0/5 nodes are available: 1 Insufficient cpu, 1 node(s) had untolerated taint {my-project.io/control-plane: true}, 3 node(s) had untolerated taint {node.kubernetes.io/not-ready: }. preemption: 0/5 nodes are available: 1 No preemption victims found for incoming pod, 4 Preemption is not helpful for scheduling..
Warning FailedScheduling 11m default-scheduler 0/5 nodes are available: 1 node(s) had untolerated taint {my-project.io/control-plane: true}, 2 Insufficient cpu, 2 node(s) had untolerated taint {node.kubernetes.io/not-ready: }. preemption: 0/5 nodes are available: 2 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling..
Warning FailedScheduling 10m (x3 over 11m) default-scheduler 0/5 nodes are available: 1 Insufficient memory, 1 node(s) had untolerated taint {my-project.io/control-plane: true}, 4 Insufficient cpu. preemption: 0/5 nodes are available: 1 Preemption is not helpful for scheduling, 4 No preemption victims found for incoming pod..
Warning FailedScheduling 10m (x6 over 11m) default-scheduler 0/5 nodes are available: 1 node(s) had untolerated taint {my-project.io/control-plane: true}, 4 Insufficient cpu. preemption: 0/5 nodes are available: 1 Preemption is not helpful for scheduling, 4 No preemption victims found for incoming pod..
Warning FailedScheduling 4m35s (x11 over 9m49s) default-scheduler 0/5 nodes are available: 1 node(s) had untolerated taint {my-project.io/control-plane: true}, 2 Insufficient memory, 4 Insufficient cpu. preemption: 0/5 nodes are available: 1 Preemption is not helpful for scheduling, 4 No preemption victims found for incoming pod..
Normal NotTriggerScaleUp 4m9s (x55 over 14m) cluster-autoscaler pod didn't trigger scale-up: 1 node(s) had untolerated taint {my-project.io/control-plane: true}, 1 max node group size reached
Normal Nominated 114s karpenter Pod should schedule on: nodeclaim/node-95dlj
Normal Pulled 36s kubelet Container image "****.dkr.ecr.ap-northeast-1.amazonaws.com/worker.my-project.io:06b18dd453233106abf4e4690efda2aabc067f84" already present on machine
Normal Created 36s kubelet Created container worker
Normal Started 36s kubelet Started container worker |
After these pods stop, some nodes disrupt as expected, while others do not. |
@e-koma From the looks of it, the pod seemed to have a podAffinity. This most likely the reason why Karpenter decided to not consolidate a node. This is called out as part of our docs here: https://karpenter.sh/docs/concepts/disruption/#consolidation
|
@engedaam Details
I would appreciate your help in investigating what is happening with these two nodes. |
@e-koma Can you share the details on the nodes the were not disrupted? |
Sure, I'll try to recreate it around this Friday! |
@engedaam Node Info
kubectl describe nodeclaim node-68kbk
kubectl describe node
Now, from the results of reproducing the issue, I found the problematic part.
It seems that the However, if this is the reason, then everyone using the ebs-csi-driver addon should encounter the same issue,. |
Description
Observed Behavior:
I have an under utilized node provisioned by karpenter and it never gets disrupted
Note: I have spotToSpotConsolidation enabled
Expected Behavior:
I expect this node to be disrupted
Reproduction Steps (Please include YAML):
Provision a node to 60% for example. Then scale down the resources / pods to get to about 10% and try to see the node getting disrupted
Versions:
kubectl version
): 1.27.11The text was updated successfully, but these errors were encountered: