Allow administrators to guide upgrade ordering #2059

cgwalters · 2020-09-03T20:26:04Z

Today the MCO makes no attempt to apply any ordering to which nodes it updates from the candidates. One problem we're thinking about is (particularly on bare metal scenarios where there might be a lot of pods on a node, and possibly pods expensive to reschedule like CNV) that it's quite possible that workloads are disrupted multiple times for an OS upgrade.

When we go to drain a node, its pods will be rescheduled across the remaining nodes...and then we will upgrade one of those, quite possibly moving one of the workload pods again etc.

One idea here is to add the minimal hooks such that a separate controller could influence this today.

If for example we supported a label machineconfig.openshift.io/upgrade-weight=42 and the node controller picked the highest weight node, then the separate controller could also e.g. mark $number nodes which are next in the upgrade ordering as unschedulable, ensuring that the drain from the current node doesn't land on them.

Without excess capacity or changing the scheduler to more strongly prefer packing nodes it seems hard to avoid multiple disruption, but the label would allow this baseline integration.

The text was updated successfully, but these errors were encountered:

cgwalters · 2020-09-03T21:35:08Z

Also strongly related to this is #2035 - if anyone wants to e.g. avoid multiple upgrade disruption by scaling up new nodes - today those nodes start at the old config and so will then shortly get upgraded and rebooted. I think other kube distros tend to do upgrades by just creating a new node at the new config and deleting the old node, so they're less subject to this (but that model can't easily apply on bare metal). If we had that PR we could at least have a phase like:

upgrade one node
some sort of linkup between MCO and machineAPI says "OK I want some upgrade burst capacity" and scales up the worker machineset, which now has new nodes at the new config
old nodes could be targeted for deletion instead of being upgraded

In combination with the above we could allow admins to choose between in-place updates and the burst upgrade strategy at least in cloud.

smarterclayton · 2020-09-03T21:38:00Z

I will spawn this as a separate issue but openshift should bias the scheduler to prefer nodes with capacity that are “up to date” over nodes that might have more capacity but are “not up to date”, because a not up to date node will eventually need to be drained. That implies there should be a label or other marker on nodes that by convention indicates a node that is “not up to date and likely to be drained in the near future” (which mco would manage) and the scheduler should avoid that without completely ruining other scheduling criteria. This is a soft criteria because it is better to disrupt a workload twice than let the workload dip below the preferred availability of the cluster.

cgwalters · 2020-09-10T16:10:45Z

There's a spectrum here - some organizations might want to use this capability to guide node upgrades e.g. to nodes hosting less critical services first.

I think though as long as the "upgrade ordering controller" ensures that at least one control plane machine or one worker are upgradable, then the MCO can generally work fine. We would also have a requirement that for major version upgrades, the control plane is upgraded first. (It seems simpler to apply the constraint across the board that both a control plane and worker are upgradable always)

The precise definition of "upgradable" today comes down to this:

machine-config-operator/pkg/controller/node/node_controller.go

Line 834 in ab32432

    
           // capacity.  It is the reponsibility of the caller to choose a subset of the nodes given the capacity.

There's ongoing work/design on having the MCO actually manage its Upgradable flag - we should set that to false if we detect the control plane or worker pool drifting too far, which would help here.

cgwalters · 2020-09-10T17:42:00Z

Another important detail here is not just the ordering but controlling when any updates are applied. There are a few approaches to that today:

Marking a node unschedulable will mean the MCO won't upgrade it (controlling schedulability is necessary to avoid multiple disruption until Allow administrators to guide upgrade ordering #2059 (comment) )
Pausing the machineconfig pool

Another big hammer is to create lots of machineconfigpools, perhaps even to the point of a pool per node. This is a pretty extreme scenario for the MCO, I think it would probably work but create

Better approaches:

Add a special value for the machineconfig.openshift.io/upgrade-ordering annotation like none to mean "don't touch this"
add KEP: Node Maintenance Lease kubernetes/enhancements#1411 ?

michaelgugino · 2020-09-10T17:58:02Z

* [kubernetes/enhancements#1411](https://github.com/kubernetes/enhancements/pull/1411) ?

One possible implementation using the proposed lease might look like

Prior to upgrading, all nodes that may only be upgraded during a maintenance window have the maintenance lease acquired by either the administrator or a controller that the user provides (which we may provide in the future, bare metal wants to build something very similar). This may be part of an upgrade-specific pre-flight that administrators perform on their cluster, or this lease may be acquired 100% of the time (though this would be sort of an anti-pattern), this is up to the cluster administrator.
MCO respects the lease. If someone other than the MCO holds the lease, MCO takes no action on that host.
Administrator or controller releases the lease on one or more nodes at a time they see fit.
Upon next reconcile, MCO acquires the lease for one or more nodes.
MCO performs upgrade.
MCO releases the lease for one or more nodes.

To put it plainly, the lease is an intent to perform or prevent some disruptive action. This exposes an interface which others may build on top of useful components or procedures. This also allows other components to have less knowledge about the MCO, and the MCO to have less knowledge of these other components.

runcom · 2020-09-10T18:34:25Z

To put it plainly, the lease is an intent to perform or prevent some disruptive action. This exposes an interface which others may build on top of useful components or procedures. This also allows other components to have less knowledge about the MCO, and the MCO to have less knowledge of these other components.

this sounds just great - I've been looking at the lease type since a while now and this seems kind of related and super useful

runcom · 2020-09-10T18:34:45Z

sounds also like similar to airlock https://github.com/poseidon/fleetlock#manual-intervention

cgwalters · 2020-09-10T21:22:09Z

There's clearly overlap between "upgrade weighting" (this initial proposal) and "upgrade blocking/leases". We can clearly stretch "weighting" into including "blocking" per above. It doesn't solve interaction with other components like leases would (and we have that problem in general with machineAPI).

In all of this remember though it's not just about the single reboot - in the general case I think we should handle "accumulated disruption" where pods are rescheduled onto nodes that are then immediately chosen for update.

Well...OTOH leases would solve at least one "extreme" case where pods are pinned to nodes and we want basically per-node locking and updating at most one controlplane/worker at a time; in that case the admin or controller puts a lease on every other node.

Hmm so I guess I am leaning towards leases.

crawford · 2020-09-14T16:22:58Z

I am very much against this ordering mechanism. It's not clear to me how a user will successfully make use of this, short of plotting out on a whiteboard all of the pods and machines and a plan for how pods will be moved and machines updated. That mental image couldn't be farther from what we are trying to do with OpenShift and Kubernetes in general. As Clayton alluded to above, all of this scheduling should be automatic and done without user intervention.

Unless I'm missing something, this is an anti-pattern and we should move in a different direction.

michaelgugino · 2020-09-14T16:41:36Z

I am very much against this ordering mechanism. It's not clear to me how a user will successfully make use of this, short of plotting out on a whiteboard all of the pods and machines and a plan for how pods will be moved and machines updated. That mental image couldn't be farther from what we are trying to do with OpenShift and Kubernetes in general. As Clayton alluded to above, all of this scheduling should be automatic and done without user intervention.

Unless I'm missing something, this is an anti-pattern and we should move in a different direction.

This is a story that came up in 4.0 planning. We need a way for users to upgrade X% of their nodes to a given release and let that soak. This is somewhat a different concern than the original issue here.

I'm in fan of informing the scheduler of the kubelet version and having it prioritize the newest ones. Seems like it would be easy to do. 'Do scheduling as normal to find suitable, sort by kubelet version, pick highest in that list'. Of course, if the scheduler only finds the first-fit rather than the best-fit, then this would be a much larger change (but based on how preemption works, possibly works in the former case).

cgwalters · 2020-09-14T16:46:06Z

It's not clear to me how a user will successfully make use of this, short of plotting out on a whiteboard all of the pods and machines and a plan for how pods will be moved and machines updated.

Nothing in this proposal requires that all nodes have a priority. Unlabeled nodes could e.g. be 0, then an admin could use negative values to be after the default, and positive ones before. For example, adding negative priority to a few critical nodes hosting the postgres database.

It also doesn't require planning out how pods move - that's a strongly related but orthogonal thing.

That said, this request is motivated by one request at an extreme to do exactly what you're saying - the admin wants to choose which nodes to update. In this case there isn't concern about interaction with e.g. PDBs because the pods are pinned to nodes.

derekwaynecarr · 2020-09-16T17:56:29Z

I agree that marking the node as no longer schedulable is a nice way of identifying that a node is readying undergoing maintenance, I had been looking at a situation where I actually want a pod to schedule to a node, I just dont want the node upgraded. So an ability to cordon a node from an upgrade pool, but not from scheduling is the semantic I was trying to think through.

derekwaynecarr · 2020-09-16T22:52:27Z

Another observation:

Should we add a rollout strategy concept on a machine config pool? One example rollout strategy would be update by topology, so I could ensure all nodes in zone X are updated before zone Y. This would aid workloads that are zone or location aware.

megan-hall · 2020-10-01T16:47:08Z

Hopefully this comment relates to this particular issue. During a cluster upgrade we (UXD) would like to be able to surface in the console the difference between a node that is 'ready and up to date' vs. a node that is 'ready and waiting to update'. Is there something we could do so that users can quickly distinguish between the two? Perhaps an annotation - waiting to update - or something like that?

wking · 2020-10-06T02:33:26Z

One example rollout strategy would be update by topology, so I could ensure all nodes in zone X are updated before zone Y. This would aid workloads that are zone or location aware.

This would increase your exposure to zone-outages, wouldn't it? What if you had a workload that happened to be incompatible with the incoming MachineConfig. You could have PDBs that protect it from excessive degradation during a randomized rollout, but if it gets sqeezed into the old-MachineConfig zone A, zones B and C completely update, and then zone A has an outage, you'd be immediately thrust into an outage for that workload.

cgwalters · 2020-10-15T00:45:52Z

Initial implementation in #2162 and a separate PR in #2163

Today the MCO arbitrarily chooses a node to update from the candidates. We want to allow admins to avoid specific nodes entirely (for as long as they want) as well as guide upgrade ordering. This replaces the defunct etcd-specific code with support for a generic annotation `machineconfiguration.openshift.io/update-order` that allows an external controller (and/or human) to do both of these. Setting it to `0` will entirely skip that node for updates. Otherwise, higher values are preferred. Closes: openshift#2059

Today the MCO arbitrarily chooses a node to update from the candidates. We want to allow admins to avoid specific nodes entirely. (Aside: This replaces the defunct etcd-specific ordering code that we didn't end up using) Add an annotation `machineconfiguration.openshift.io/hold` that allows an external controller (and/or human) to avoid specific nodes. Related: openshift#2059

zvonkok · 2020-10-15T16:58:51Z

This could help in preventing nodes from upgrading to a kernel that is not supported by an out-of-tree driver. These special resource nodes can only run the workload. Draining those could kill the workload since the "general" CPU nodes cannot handle them.
Customers could then try to build the drivers on the updated nodes without disrupting the special nodes.
The question is how many update cycles would this annotated node "survive"?

openshift-bot · 2021-03-03T19:42:56Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-04-02T21:36:01Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-05-03T00:57:12Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot · 2021-05-03T00:57:23Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cgwalters · 2021-09-24T13:37:06Z

/lifecycle frozen
This keeps coming up

cgwalters · 2021-09-24T13:43:11Z

Today, any admin that wants to fully control node upgrade ordering can:

First, pause the pool controlling those nodes (e.g. worker). The node controller will then not apply desiredConfig annotations on the targeted notes. However, crucially the render controller will still generate new rendered machineconfig.

So for example, when you upgrade the cluster to pick up a new kernel security update, the worker pool will update to a new spec.configuration. (Note if you've paused the worker but not control plane, then the control plane will have updated)

Now, if you want to manually target a node for update, it should work to oc edit node/<somenode> and update the machineconfiguration.openshift.io/desiredConfig annotation to the current value of spec.configuration in the worker pool.

The machine-config-daemon (daemonset) running on that node will then react to that annotation, and apply it (drain + reboot as needed) in the same way it would as if the node controller had changed it.

cgwalters mentioned this issue Oct 15, 2020

node-controller: Support an annotation to hold/prioritize updates #2162

Closed

cgwalters mentioned this issue Oct 15, 2020

node-controller: Support an annotation to hold updates #2163

Closed

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 3, 2021

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 2, 2021

openshift-ci-robot closed this as completed May 3, 2021

cgwalters reopened this Sep 24, 2021

openshift-ci bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Sep 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow administrators to guide upgrade ordering #2059

Allow administrators to guide upgrade ordering #2059

cgwalters commented Sep 3, 2020

cgwalters commented Sep 3, 2020

smarterclayton commented Sep 3, 2020 •

edited

cgwalters commented Sep 10, 2020 •

edited

cgwalters commented Sep 10, 2020

michaelgugino commented Sep 10, 2020

runcom commented Sep 10, 2020

runcom commented Sep 10, 2020

cgwalters commented Sep 10, 2020

crawford commented Sep 14, 2020

michaelgugino commented Sep 14, 2020

cgwalters commented Sep 14, 2020

derekwaynecarr commented Sep 16, 2020 •

edited

derekwaynecarr commented Sep 16, 2020

megan-hall commented Oct 1, 2020

wking commented Oct 6, 2020

cgwalters commented Oct 15, 2020 •

edited

zvonkok commented Oct 15, 2020

openshift-bot commented Mar 3, 2021

openshift-bot commented Apr 2, 2021

openshift-bot commented May 3, 2021

openshift-ci-robot commented May 3, 2021

cgwalters commented Sep 24, 2021

cgwalters commented Sep 24, 2021

Allow administrators to guide upgrade ordering #2059

Allow administrators to guide upgrade ordering #2059

Comments

cgwalters commented Sep 3, 2020

cgwalters commented Sep 3, 2020

smarterclayton commented Sep 3, 2020 • edited

cgwalters commented Sep 10, 2020 • edited

cgwalters commented Sep 10, 2020

michaelgugino commented Sep 10, 2020

runcom commented Sep 10, 2020

runcom commented Sep 10, 2020

cgwalters commented Sep 10, 2020

crawford commented Sep 14, 2020

michaelgugino commented Sep 14, 2020

cgwalters commented Sep 14, 2020

derekwaynecarr commented Sep 16, 2020 • edited

derekwaynecarr commented Sep 16, 2020

megan-hall commented Oct 1, 2020

wking commented Oct 6, 2020

cgwalters commented Oct 15, 2020 • edited

zvonkok commented Oct 15, 2020

openshift-bot commented Mar 3, 2021

openshift-bot commented Apr 2, 2021

openshift-bot commented May 3, 2021

openshift-ci-robot commented May 3, 2021

cgwalters commented Sep 24, 2021

cgwalters commented Sep 24, 2021

smarterclayton commented Sep 3, 2020 •

edited

cgwalters commented Sep 10, 2020 •

edited

derekwaynecarr commented Sep 16, 2020 •

edited

cgwalters commented Oct 15, 2020 •

edited