AKS Node not rebooted with lock held for not existing node #847

andres32168 · 2023-11-06T07:28:22Z

Hi,

we're facing an issue with the newest version of Kured 1.14.0.

Nodes are not rebooted

Prometheus Metrics say that a reboot is required but on the Node Host there is no file /var/run/reboot-required present.

# HELP kured_reboot_required OS requires reboot due to software updates.
# TYPE kured_reboot_required gauge
kured_reboot_required{node="aks-XXX-XXXXXXXXXX-vmss000000"} 1

Adding the file manually to the node results in a message "warning msg="Lock already held:" for a not longer existing node.

time="2023-11-06T06:37:59Z" level=info msg="Binding node-id command flag to environment variable: KURED_NODE_ID"
time="2023-11-06T06:37:59Z" level=info msg="Kubernetes Reboot Daemon: 1.14.0"
time="2023-11-06T06:37:59Z" level=info msg="Node ID: aks-XXX-XXXXXXXXXX-vmss000000"
time="2023-11-06T06:37:59Z" level=info msg="Lock Annotation: base-mon/kured:weave.works/kured-node-lock"
time="2023-11-06T06:37:59Z" level=info msg="Lock TTL set, lock will expire after: 30m0s"
time="2023-11-06T06:37:59Z" level=info msg="Lock release delay not set, lock will be released immediately after rebooting"
time="2023-11-06T06:37:59Z" level=info msg="PreferNoSchedule taint: "
time="2023-11-06T06:37:59Z" level=info msg="Blocking Pod Selectors: []"
time="2023-11-06T06:37:59Z" level=info msg="Reboot schedule: ---MonTueWedThu------ between 02:00 and 08:00 UTC"
time="2023-11-06T06:37:59Z" level=info msg="Reboot check command: [test -f /var/run/reboot-required] every 1h0m0s"
time="2023-11-06T06:37:59Z" level=info msg="Concurrency: 1"
time="2023-11-06T06:37:59Z" level=info msg="Reboot command: [/bin/systemctl reboot]"
time="2023-11-06T06:37:59Z" level=info msg="Will annotate nodes during kured reboot operations"
time="2023-11-06T07:09:55Z" level=info msg="Reboot required"
time="2023-11-06T07:09:55Z" level=warning msg="Lock already held: aks-XXX-XXXXXXXXXX-vmss000024"

We added the lockTtl flag from the Helm-Chart

configuration:
  endTime: "08:00"                           # only reboot before this time of day (default "23:59") time is UTC
  rebootDays: ["mo", "tu", "we", "th"]       # only reboot on these days (default [su,mo,tu,we,th,fr,sa])
  startTime: "02:00"                         # only reboot after this time of day (default "0:00") time is UTC
  concurrency: 1
  lockTtl: "30m"
  annotateNodes: true

The nodes will got the annotation

weave.works/kured-most-recent-reboot-needed: 2023-11-06T02:21:56Z                                                                                                                          weave.works/kured-reboot-in-progress: 2023-11-06T02:21:56Z

Do you have any idea why this happens?

I know, that messages are from today and not much time between config and possible reboots but the it was also hole last week, this are only the newest logs after redeployment with increased endTime

Thank you in advance
André

The text was updated successfully, but these errors were encountered:

ckotzbauer · 2023-11-06T12:06:16Z

This seems to be a related to #822. The problem might appear, when a lock is held from a node which is removed from the cluster. However, the thing that the metric highlights a node which does not need a reboot is new, but may be a result of the wrong lock behaviour.
We'll have a look in the next few days.

github-actions · 2024-01-06T01:46:34Z

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

andres32168 · 2024-01-08T06:41:14Z

not stale

gyoza · 2024-02-14T17:02:19Z

This seems to happen when karpenter nodes get rebooted and karpenter nukes them before they come back. This is happening to me on EKS as well with lockttl set to 30m.

jackfrancis · 2024-02-14T17:15:05Z

@gyoza that sounds like a scenario where this could happen.

My main question for folks that are experiencing this is: is the TTL configuration not working? At present, kured makes no guarantees that a node will continue to exist after it successfully acquires the lock (annotation on the kured daemonset); but it does guarantee that if you configure a lock TTL that the lock will be released after the TTL expires (whether or not the node that acquired the lock still exists at that time).

Are we seeing different behavior than described above?

gyoza · 2024-02-14T17:31:08Z

@jackfrancis Exactly! I figured the lock would expire if the node was around or not but that does not seem to be the case.

Daemonset:

Image:      ghcr.io/kubereboot/kured:1.15.0
Port:       8080/TCP
Host Port:  0/TCP
Command:
  /usr/bin/kured
    Args:
      --ds-name=kured
      --ds-namespace=core
      --metrics-port=8080
      --lock-ttl=30m

Logs:

kured-gzmfr kured time="2024-02-14T07:36:15Z" level=warning msg="Lock already held: replaced-name.deadnode.compute.internal"
kured-8kn2v kured time="2024-02-14T17:14:49Z" level=warning msg="Lock already held: replaced-name.deadnode.compute.internal"

The only way i can get things to get back to work momentarily is to rollout restart the daemonset on each context.

gyoza · 2024-02-14T20:11:02Z

It appears that even on a daemonset rollout restart that specific node lock seems to show up again.

gyoza · 2024-02-15T19:33:45Z

Is there a way to force to clear the lock manually?

github-actions · 2024-04-16T01:45:39Z

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

andres32168 · 2024-04-16T05:04:54Z

not stale

ckotzbauer added bug aks labels Nov 6, 2023

github-actions bot added the no-issue-activity label Jan 6, 2024

ckotzbauer removed the no-issue-activity label Jan 8, 2024

github-actions bot added the no-issue-activity label Apr 16, 2024

github-actions bot removed the no-issue-activity label Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AKS Node not rebooted with lock held for not existing node #847

AKS Node not rebooted with lock held for not existing node #847

andres32168 commented Nov 6, 2023

ckotzbauer commented Nov 6, 2023

github-actions bot commented Jan 6, 2024

andres32168 commented Jan 8, 2024

gyoza commented Feb 14, 2024 •

edited

jackfrancis commented Feb 14, 2024

gyoza commented Feb 14, 2024 •

edited

gyoza commented Feb 14, 2024

gyoza commented Feb 15, 2024

github-actions bot commented Apr 16, 2024

andres32168 commented Apr 16, 2024

AKS Node not rebooted with lock held for not existing node #847

AKS Node not rebooted with lock held for not existing node #847

Comments

andres32168 commented Nov 6, 2023

ckotzbauer commented Nov 6, 2023

github-actions bot commented Jan 6, 2024

andres32168 commented Jan 8, 2024

gyoza commented Feb 14, 2024 • edited

jackfrancis commented Feb 14, 2024

gyoza commented Feb 14, 2024 • edited

gyoza commented Feb 14, 2024

gyoza commented Feb 15, 2024

github-actions bot commented Apr 16, 2024

andres32168 commented Apr 16, 2024

gyoza commented Feb 14, 2024 •

edited

gyoza commented Feb 14, 2024 •

edited