Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock TTL not being honored #822

Open
j-rust opened this issue Aug 22, 2023 · 5 comments
Open

Lock TTL not being honored #822

j-rust opened this issue Aug 22, 2023 · 5 comments
Labels
aks bug keep This won't be closed by the stale bot.

Comments

@j-rust
Copy link

j-rust commented Aug 22, 2023

We're seeing instances of the lock-ttl flag not being honored and being held indefinitely, blocking all reboots until the lock is manually removed. Is this a known issue and/or is there a workaround available?

kured command (running version 1.13.0):

Command:
      /usr/bin/kured
      --ds-namespace=kured
      --period=1h0m0s
      --lock-release-delay=15m
      --lock-ttl=60m
      --reboot-days=mon,tue,wed,thu
      --start-time=08:00
      --end-time=17:00
      --time-zone=America/Los_Angeles

Annotation showing lock with incorrect TTL

weave.works/kured-node-lock:
                  {"nodeID":"aks-support-38204793-vmss00006n","metadata":{"unschedulable":false},"created":"2023-08-22T15:54:59.276872543Z","TTL":3600000000...
@ckotzbauer
Copy link
Member

Hi @j-rust,
thanks for reporting this. Currently we don't have an open issue for that, can you please try to update to 1.14.0? We adjusted the lock-logic there to reboot concurrently, however the ttl-logic itself was untouched. When the problem still persists, we have to investigate here.

@j-rust
Copy link
Author

j-rust commented Aug 23, 2023

Thanks @ckotzbauer, I'll test 1.14.0 out on a few of our clusters and let you know if we're still seeing the same issue.

@j-rust
Copy link
Author

j-rust commented Aug 24, 2023

Hey @ckotzbauer, I deployed version 1.14.0 but I'm still seeing the incorrect TTL. It looks like the node holding the lock no longer exists so I think this is an issue with a node being removed when the cluster autoscaler kicks in that causes the incorrect lock TTL to appear.

weave.works/kured-node-lock:
                  {"nodeID":"aks-support-38204793-vmss000070","metadata":{"unschedulable":false},"created":"2023-08-24T15:11:51.003289538Z","TTL":3600000000...

@poochwashere
Copy link

I had a similar issue where a lock was in place for a node that was removed by the cluster autoscaler before I had the TTL setting enabled. I expected that when I updated the helm chart with the setting lockTtl: 30m it would remove said lock after 30 minutes.

Days later I had to manually delete the lock annotation on the DS.

@github-actions
Copy link

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aks bug keep This won't be closed by the stale bot.
Projects
None yet
Development

No branches or pull requests

3 participants