Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow setting label to nodes about to be upgraded/restarted #3204

Open
ibotty opened this issue Jun 22, 2022 · 8 comments
Open

allow setting label to nodes about to be upgraded/restarted #3204

ibotty opened this issue Jun 22, 2022 · 8 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@ibotty
Copy link

ibotty commented Jun 22, 2022

Description

Because there is no agreed-upon way to signal operators that a node is drained, there are multiple ways that operators handle it.
Rook detects node drain by observing pods on the node. This works fine but feels a bit fragile.
The problem is that some operators (e.g. the Zalando PostgreSQL Operator) "detect" drains by watching node's labels. Whenever a label is not set anymore (e.g. "node-ready=true") it will (try to) failover to another DB pod on another node.

This is a feature request to update node's labels when a reboot is about to happen.

Steps to reproduce the issue:

  1. update some machineconfig,
  2. observe machine-config-daemon trying to drain a node,
  3. failing to drain the node because there is a pdb on a pod on that node,

meanwhile
4. some operator not knowing that the machine is about to be rebooted and not updating the pdb (directly or indirectly.)

  1. the node not getting drained.

Describe the results you expected:

  1. update some machineconfig,
  2. machine-config-daemon updating label machineconfiguration.openshift.io/pending-restart=false to =true,
    3a. an operator removes active workload from the node, removing/updating pdbs that affect the node,
    3b. machine-config-daemon drains the node,
  3. node reboots successful,
  4. machine-config-daemon sets label machineconfiguration.openshift.io/pending-restart=false.
@cgwalters
Copy link
Member

Hi, thanks for filing this!

This issue relates to a topic of reboot handling that's ongoing, for which most information/discussion is (AFAIK) sadly trapped in internal-to-RH proprietary systems because staying open requires relentless commitment and we aren't consistent about that.

machine-config-daemon updating label machineconfiguration.openshift.io/pending-restart=false to =true

I think we should avoid having OpenShift/MCO-specific labels here; we want to interoperate with the rest of the Kubernetes ecosystem.

Rook detects node drain by observing pods on the node. This works fine but feels a bit fragile.

Note that https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown will make this more reliable and we (OCP) plan to roll that out.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 21, 2022
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 21, 2022
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 21, 2022

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot closed this as completed Nov 21, 2022
@ibotty
Copy link
Author

ibotty commented Nov 21, 2022

Still relevant.

And reading https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown another time, I don't see how that will help the use case described above. How can rook know that the node is about to shut down. The only taint (or annotation) that is described is for **non-**graceful shutdown which the machine-config-daemon will explicitly not do.

@cgwalters: Do I misunderstand the mechanism?

/remove-lifecycle rotten
/lifecycle frozen

@openshift-ci openshift-ci bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Nov 21, 2022
@ibotty
Copy link
Author

ibotty commented Nov 21, 2022

/reopen

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 21, 2022

@ibotty: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot reopened this Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

3 participants