Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kured says reboot not required even though there is a reboot-require file present on the kubernetes cluster linux node #787

Open
deepaknani007 opened this issue Jun 20, 2023 · 15 comments
Labels
keep This won't be closed by the stale bot. question

Comments

@deepaknani007
Copy link

Deployed the latest release of kured(1.13.1) on to an Azure kubernetes cluster with kubernetes version (v1.26.3) almost one month back. I don't see any reboot-required created on the nodes and so I have created the dummy "reboot-required" file present in the "/var/run" path on all nodes of the cluster. Unfortunately the nodes are not rebooting and looking at the logs for the kured pods it says reboot not required.

Create /var/run/reboot-required Dummy file:
image

Kured pod logs:
image

@deepaknani007
Copy link
Author

Do we need to have cluster auto-upgrade enabled with node-image to make kured to work?

@ckotzbauer
Copy link
Member

Hi @deepaknani007,
thanks for the bug-report. Does this behaviour still exist? It's hard to evaluate why the file was not detected. There are no further configs required on the infrastructure to make kured work.

@jorgelon
Copy link

jorgelon commented Aug 11, 2023

same problem in kube-adm deployed cluster with flatcar stable.
the file is there in all nodes but no reboots. it is a vanilla installation of kured in kubernetes v1.27.2

time="2023-08-11T10:01:27Z" level=info msg="Binding node-id command flag to environment variable: KURED_NODE_ID"
time="2023-08-11T10:01:27Z" level=info msg="Kubernetes Reboot Daemon: 1.13.2"
time="2023-08-11T10:01:27Z" level=info msg="Node ID: XXX"
time="2023-08-11T10:01:27Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock"
time="2023-08-11T10:01:27Z" level=info msg="Lock TTL not set, lock will remain until being released"
time="2023-08-11T10:01:27Z" level=info msg="Lock release delay not set, lock will be released immediately after rebooting"
time="2023-08-11T10:01:27Z" level=info msg="PreferNoSchedule taint: "
time="2023-08-11T10:01:27Z" level=info msg="Blocking Pod Selectors: []"
time="2023-08-11T10:01:27Z" level=info msg="Reboot schedule: SunMonTueWedThuFriSat between 00:00 and 23:59 UTC"
time="2023-08-11T10:01:27Z" level=info msg="Reboot check command: [test -f /var/run/reboot-required] every 1h0m0s"
time="2023-08-11T10:01:27Z" level=info msg="Reboot command: [/bin/systemctl reboot]"

later i can see "reboot not required"

@ckotzbauer
Copy link
Member

Okay, thanks for this information. We will do a release later this month when kubernetes released its next minor. #806 will be included there which adds a warn-log for non -1 exit-codes for the sentinel-check command. Maybe something is crashing here on your hosts. This would cause kured to avoid reboots.

@jorgelon
Copy link

in my case, finally the reboot started without changing anything. I do not know the reason. If I find something I will tell you

@jorgelon
Copy link

jorgelon commented Sep 13, 2023

A new flatcar release and the same problem. The /var/run/reboot-required file exists but no reboots.
I am working with the 1.14.0 release and i do not find a way to debug this.
How kured checks if that file exists?

The grafana dashboard shows the nodes need to be rebooted

@ckotzbauer
Copy link
Member

@jorgelon Do you see the following warn-log in the kured-pod-logs?

sentinel command ended with unexpected exit code

This was added with 1.14.0. The problem with the host-commands is: We don't know what happens on the host, when the command crashes with an unexpected error or is blocked by some security-tools (e.g. aquasec, ...) this warn-log is the only indicator. Maybe you can analyze your host-logs for abnormalities around the check-executions.

@jorgelon
Copy link

Nope @ckotzbauer
I do not see that log
I only see the same as in my Aug 11 annotation

@ckotzbauer
Copy link
Member

Okay, that's sad. Then it will be very hard to figure out why the file is not detected. Kured logs the output of the "test -f" command and logs a warning when the exit-code is something unexpected. So it seems that the command either crashes silently (maybe something is logged in the syslog) or just exits with the exit-code which indicates that no reboot is required (also when the file exists)

We will land some bigger security-improvements to 1.15.0, then we will mount the directory of the reboot-file as host-mount and do a "normal" existance check without "nsenter", this should work more smoothly. But 1.15.0 will be released after Kubernetes 1.29.0 (so in December).

Copy link

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

@ckotzbauer ckotzbauer added keep This won't be closed by the stale bot. and removed no-issue-activity labels Nov 16, 2023
@jorgelon
Copy link

updates with 1.15.0 . no changes

inside a kured pod in a node with /var/run/reboot-required present

/tmp # /usr/bin/nsenter -m/proc/1/ns/mnt -- test -f /var/run/reboot-required
/tmp # echo $?
0
/tmp # wget -qO- 127.0.0.1:8080/metrics | grep ^kured
kured_reboot_required{node="XXXXXX"} 0

@jorgelon
Copy link

jorgelon commented Jan 23, 2024

I have tried using
https://raw.githubusercontent.com/kubereboot/kured/main/kured-rbac.yaml
https://raw.githubusercontent.com/kubereboot/kured/main/kured-ds-signal.yaml

Now I get
wget -qO- 127.0.0.1:8080/metrics | grep ^kured
kured_reboot_required{node="XXX"} 1

/ # /usr/bin/nsenter -m/proc/1/ns/mnt -- test -f /var/run/reboot-required
nsenter: can't open '/proc/1/ns/mnt': Permission denied

The kured-1.15.0-dockerhub.yaml does not mount anything from the host.

Still no reboots

@ckotzbauer
Copy link
Member

ckotzbauer commented Jan 23, 2024

Thanks @jorgelon for coming back to this thread,
with 1.15.0 the nsenter on the hosts /var/run/reboot-required is not required anymore.

You have two options now:

  1. Use the kured-ds.yaml with the host-mount (or the kured-1.15.0-dockerhub.yaml - I updated the file right now and added the missing host - thanks for the hint), this should be more stable than the nsenter. kured then checks for /sentinel/reboot-required
  2. You can use the new signal method, this works also with the host-mount and does not use nsenter at all, why did you try the nsenter command inside this pod? It is intended not to work.

@jorgelon
Copy link

jorgelon commented Jan 25, 2024

Right now I am using the helm chart to see if I get some different results. Default values.yaml
I have 4 flatcar nodes. Only 2 need reboot and the pods in the daemon set shows the correct result with
wget -qO- 127.0.0.1:8080/metrics | grep ^kured
The response in the nodes that need reboot is 1
test -f /sentinel/reboot-required returns 0

But nothing happens. No reboot, no log, no annotations
I keep investigating

My doubt is how /bin/systemctl reboot is performed if that binary does not exists in the kured pods

@ckotzbauer
Copy link
Member

ckotzbauer commented Jan 25, 2024

The binary is not called inside the pod, its called with nsenter on the host.
Does the problem persist with the "signal" method and the helm-chart?
Does the pod still write "Reboot not required"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep This won't be closed by the stale bot. question
Projects
None yet
Development

No branches or pull requests

3 participants