Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kured should verify if node is still drained when delaying reboot #929

Open
sebastiangaiser opened this issue May 8, 2024 · 5 comments · May be fixed by #930
Open

Kured should verify if node is still drained when delaying reboot #929

sebastiangaiser opened this issue May 8, 2024 · 5 comments · May be fixed by #930

Comments

@sebastiangaiser
Copy link

Hey,

we had a problem with two controllers draining/uncordon nodes. To make the reboot more graceful, we configured rebootDelay: 120s. During these 2 mins the other controller uncordened the node. Kured rebooted the not drained node afterwards which is not optimal :D
So Kured should verify after the delay and right before the reboot is issued, if the node is still drained or not.

@jackfrancis
Copy link
Collaborator

I'm not sure there is a perfect solution for this. To a certain extent the "rebootDelay" functionality is unscientific: it makes no guarantees about outcomes after the delay. It seems this is generally well understood and folks use this feature as a "best-effort gesture" to better predict drain outcomes that may trail the immediate end of the drain process (though again, not guaranteed).

If there is another controller running on the system that observes cordoned nodes and uncordons them in response, then adding a follow-up drain gets us back to the original situation, where we may succeed in re-draining, but the workloads that are drained need a delay after in order to reboot gracefully; and now the other controller uncordons during that wait period; and so on and so on...

For this particular use case I think you want to disable that controller's power over kured-drained nodes (perhaps it can observe kured node annotations and skip uncordoning such nodes?), or consider not using this sleep option if there are multiple actors competing to cordon and uncordon nodes.

Are either of those options possible in your environment?

@sebastiangaiser
Copy link
Author

The problem is definitely the other controller, not Kured. But despite the problem, how do you think about having this functionality as feature flag?

@evrardjp
Copy link
Collaborator

evrardjp commented May 13, 2024

I agree to not have something specific in Kured, here the other controller is at fault. I don' t think kured should handle such special cases, this is code to maintain for nothing.

However, I think @sebastiangaiser has some kind of point: this behaviour is suboptimal.

Let's admit the drain/cordon is triggered (let's ignore the delay for the sake of the conversation). Does it make sense for kured to trigger the reboot of the node if (just before calling the reboot command) said node is not cordonned?

For me, there is something "weird" happening, and I would hope that kured would not reboot an uncordonned node. That should trigger a big error log somewhere.

What this means, in other words, is that kured is not responsible for what's outside his control, but expects a certain state before reboot. If you do mess it up (with an operator or a manual operation), that's your decision, but it should not allow kured to continue in that case.

The follow up question is whether this makes sense for everyone or gated behind a feature flag, which I would have a tendency to say feature flag, and do not change default behaviour.

@evrardjp
Copy link
Collaborator

@sebastiangaiser what do you think of the above?

@sebastiangaiser
Copy link
Author

sebastiangaiser commented May 21, 2024

I'm also for the feature flag which can only be enabled when rebootDelay > 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants