upgrade-health-check Job fails on a single control plane node cluster after drain #3050

ilia1243 · 2024-04-23T16:44:35Z

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version): v1.30.0

Environment:

Kubernetes version (use kubectl version): v1.30.0
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release): Ubuntu 22.04.1 LTS
Kernel (e.g. uname -a): 5.15.0-50-generic
Container runtime (CRI) (e.g. containerd, cri-o): containerd=1.6.12-0ubuntu1~22.04.3
Container networking plugin (CNI) (e.g. Calico, Cilium): calico
Others:

What happened?

Install a single control plane node cluster v1.29.1
Drain the only node
kubeadm upgrade apply v1.30.0 fails with

[ERROR CreateJob]: Job "upgrade-health-check-lvr8s" in the namespace "kube-system" did not complete in 15s: client rate limiter Wait returned an error: context deadline exceeded

It seems that previously the pod was Pending as well, but this was ignored, because the jobs was successfully deleted in defer and return value was overridden with nil.

defer func() {
    lastError = deleteHealthCheckJob(client, ns, jobName)
}()

https://github.com/kubernetes/kubernetes/blob/v1.29.1/cmd/kubeadm/app/phases/upgrade/health.go#L151

Similar issue #2035

What you expected to happen?

There might be no need to create the job.

How to reproduce it (as minimally and precisely as possible)?

See What happened?

The text was updated successfully, but these errors were encountered:

neolit123 · 2024-04-23T17:26:35Z

thanks for testing @ilia1243

this is tricky problem. but either way there is a regression in 1.30 that we need to fix.

the only problem here seems to be with the CreateJob logic.

k drain node
sudo kubeadm upgrade apply -f v1.30.0 --ignore-preflight-errors=CreateJob

^ this completes the upgrade of a single node CP and addons are applied correctly.
but the CreateJob check will always fail.

one option is to skip this check if there is a single CP node in the cluster.
WDYT?

cc @SataQiu @pacoxu @carlory

neolit123 · 2024-04-23T17:29:14Z

one option is to skip this check if there is a single CP node in the cluster.

another option (for me less preferred) is to make the CreateJob health check return a warning instead of an error.
it will always show a warning on single node CP cluster.

pacoxu · 2024-04-23T17:46:33Z

+1 for skip

I need to check it again. I think I faiI for another reason that I did not install CNI for control plane and the pod failed for no CNI(not sure if this is a general use case, in this case the job should be run in hostNetwork.). I will do some test tomorrow.

SataQiu · 2024-04-24T01:31:46Z

My suggestion is to print a warning and skip Job creation when there are no nodes to schedule. WDYT?

neolit123 · 2024-04-24T04:05:57Z

My suggestion is to print a warning and skip Job creation when there are no nodes to schedule. WDYT?

IIUC, the only way to test if a job pod can schedule somewhere is to try to create the same job? the problem is that this preflight check's purpose is exactly that - check if the cluster accepts workloads.

i don't even remember why we added it, but now we need a fix it right away. perhaps later we can discuss removing it.

neolit123 · 2024-04-24T04:33:00Z

My suggestion is to print a warning and skip Job creation when there are no nodes to schedule. WDYT?

IIUC, the only way to test if a job pod can schedule somewhere is to try to create the same job? the problem is that this preflight check's purpose is exactly that - check if the cluster accepts workloads.

i don't even remember why we added it, but now we need a fix it right away. perhaps later we can discuss removing it.

we could look at Unschedulable taints on nodes, which means they were cordoned.

but listing all nodes on every kubeadm upgrade command will be very expensive in big scale clusters with many nodes.

so i am starting to think we should just convert this check to a preflight warning.

carlory · 2024-04-24T10:04:43Z

I'm not sure whether it is an expected patch for this issue?

i.e. add a new toleration into job like

{key=node.kubernetes.io/unschedulable, effect:NoSchedule}

FYI:

Or just convert this check to a preflight warning?

neolit123 · 2024-04-24T10:13:14Z

Or just convert this check to a preflight warning?

i have a WIP PR for this.

I'm not sure whether it is an expected patch for this issue?

i don't know... ideally a node should be drained before upgrading kubelet.
so if we allow pods to schedule after the node is drained with the {key=node.kubernetes.io/unschedulable, effect:NoSchedule} hack, we are breaking this rule. i don't even know if it will work.

we do upgrade coredns and kube-proxy for a single node cluster while the node is drained with kubeadm upgrade apply, but we do ignore daemon sets anyway and the coredns pods will remain pending if the node is not schedulable. so technically for the addons we don't schedule new pods IIUC.

neolit123 · 2024-04-24T10:22:51Z

i have a WIP PR for this.

please see kubernetes/kubernetes#124503
and my comments there.

neolit123 · 2024-04-25T09:09:19Z

@carlory came up with a good idea how to catch the scenario.
kubernetes/kubernetes#124503 (comment)
the PR is updated.

more reviews are appreciated.

neolit123 · 2024-04-26T17:03:31Z

fix will be added to 1.30.1
kubernetes/kubernetes#124570

neolit123 · 2024-05-17T13:02:27Z

1.30.1 is out with the fix

ilia1243 changed the title ~~upgrade-health-check Job fails on a single control plane node cluster with drain~~ upgrade-health-check Job fails on a single control plane node cluster after drain Apr 23, 2024

neolit123 added the area/upgrades label Apr 23, 2024

neolit123 assigned carlory Apr 23, 2024

neolit123 added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Apr 23, 2024

neolit123 added this to the v1.31 milestone Apr 23, 2024

neolit123 mentioned this issue Apr 24, 2024

1.30 Release Notes: "Known Issues" kubernetes/kubernetes#124046

Open

neolit123 modified the milestones: v1.31, v1.30 Apr 24, 2024

neolit123 added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Apr 24, 2024

neolit123 assigned neolit123 and unassigned carlory Apr 24, 2024

neolit123 mentioned this issue Apr 24, 2024

kubeadm: check for available nodes during 'CreateJob' preflight kubernetes/kubernetes#124503

Merged

neolit123 mentioned this issue Apr 26, 2024

Automated cherry pick of #124503: kubeadm: check for available nodes during 'CreateJob' kubernetes/kubernetes#124570

Merged

neolit123 mentioned this issue May 17, 2024

kubelet on Windows fails if a pod has SecurityContext with RunAsUser. kubernetes/kubernetes#125012

Open

neolit123 closed this as completed May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgrade-health-check Job fails on a single control plane node cluster after drain #3050

upgrade-health-check Job fails on a single control plane node cluster after drain #3050

ilia1243 commented Apr 23, 2024

neolit123 commented Apr 23, 2024

neolit123 commented Apr 23, 2024 •

edited

pacoxu commented Apr 23, 2024

SataQiu commented Apr 24, 2024

neolit123 commented Apr 24, 2024

neolit123 commented Apr 24, 2024

carlory commented Apr 24, 2024

neolit123 commented Apr 24, 2024

neolit123 commented Apr 24, 2024

neolit123 commented Apr 25, 2024

neolit123 commented Apr 26, 2024

neolit123 commented May 17, 2024

upgrade-health-check Job fails on a single control plane node cluster after drain #3050

upgrade-health-check Job fails on a single control plane node cluster after drain #3050

Comments

ilia1243 commented Apr 23, 2024

Is this a BUG REPORT or FEATURE REQUEST?

Versions

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

neolit123 commented Apr 23, 2024

neolit123 commented Apr 23, 2024 • edited

pacoxu commented Apr 23, 2024

SataQiu commented Apr 24, 2024

neolit123 commented Apr 24, 2024

neolit123 commented Apr 24, 2024

carlory commented Apr 24, 2024

neolit123 commented Apr 24, 2024

neolit123 commented Apr 24, 2024

neolit123 commented Apr 25, 2024

neolit123 commented Apr 26, 2024

neolit123 commented May 17, 2024

neolit123 commented Apr 23, 2024 •

edited