Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow administrators to guide upgrade ordering #2059

Open
cgwalters opened this issue Sep 3, 2020 · 23 comments
Open

Allow administrators to guide upgrade ordering #2059

cgwalters opened this issue Sep 3, 2020 · 23 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@cgwalters
Copy link
Member

Today the MCO makes no attempt to apply any ordering to which nodes it updates from the candidates. One problem we're thinking about is (particularly on bare metal scenarios where there might be a lot of pods on a node, and possibly pods expensive to reschedule like CNV) that it's quite possible that workloads are disrupted multiple times for an OS upgrade.

When we go to drain a node, its pods will be rescheduled across the remaining nodes...and then we will upgrade one of those, quite possibly moving one of the workload pods again etc.

One idea here is to add the minimal hooks such that a separate controller could influence this today.

If for example we supported a label machineconfig.openshift.io/upgrade-weight=42 and the node controller picked the highest weight node, then the separate controller could also e.g. mark $number nodes which are next in the upgrade ordering as unschedulable, ensuring that the drain from the current node doesn't land on them.

Without excess capacity or changing the scheduler to more strongly prefer packing nodes it seems hard to avoid multiple disruption, but the label would allow this baseline integration.

@cgwalters
Copy link
Member Author

Also strongly related to this is #2035 - if anyone wants to e.g. avoid multiple upgrade disruption by scaling up new nodes - today those nodes start at the old config and so will then shortly get upgraded and rebooted. I think other kube distros tend to do upgrades by just creating a new node at the new config and deleting the old node, so they're less subject to this (but that model can't easily apply on bare metal). If we had that PR we could at least have a phase like:

  • upgrade one node
  • some sort of linkup between MCO and machineAPI says "OK I want some upgrade burst capacity" and scales up the worker machineset, which now has new nodes at the new config
  • old nodes could be targeted for deletion instead of being upgraded

In combination with the above we could allow admins to choose between in-place updates and the burst upgrade strategy at least in cloud.

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 3, 2020

I will spawn this as a separate issue but openshift should bias the scheduler to prefer nodes with capacity that are “up to date” over nodes that might have more capacity but are “not up to date”, because a not up to date node will eventually need to be drained. That implies there should be a label or other marker on nodes that by convention indicates a node that is “not up to date and likely to be drained in the near future” (which mco would manage) and the scheduler should avoid that without completely ruining other scheduling criteria. This is a soft criteria because it is better to disrupt a workload twice than let the workload dip below the preferred availability of the cluster.

@cgwalters
Copy link
Member Author

cgwalters commented Sep 10, 2020

There's a spectrum here - some organizations might want to use this capability to guide node upgrades e.g. to nodes hosting less critical services first.

I think though as long as the "upgrade ordering controller" ensures that at least one control plane machine or one worker are upgradable, then the MCO can generally work fine. We would also have a requirement that for major version upgrades, the control plane is upgraded first. (It seems simpler to apply the constraint across the board that both a control plane and worker are upgradable always)

The precise definition of "upgradable" today comes down to this:

// capacity. It is the reponsibility of the caller to choose a subset of the nodes given the capacity.

There's ongoing work/design on having the MCO actually manage its Upgradable flag - we should set that to false if we detect the control plane or worker pool drifting too far, which would help here.

@cgwalters
Copy link
Member Author

Another important detail here is not just the ordering but controlling when any updates are applied. There are a few approaches to that today:

Another big hammer is to create lots of machineconfigpools, perhaps even to the point of a pool per node. This is a pretty extreme scenario for the MCO, I think it would probably work but create

Better approaches:

@michaelgugino
Copy link
Contributor

* [kubernetes/enhancements#1411](https://github.com/kubernetes/enhancements/pull/1411) ?

One possible implementation using the proposed lease might look like

  1. Prior to upgrading, all nodes that may only be upgraded during a maintenance window have the maintenance lease acquired by either the administrator or a controller that the user provides (which we may provide in the future, bare metal wants to build something very similar). This may be part of an upgrade-specific pre-flight that administrators perform on their cluster, or this lease may be acquired 100% of the time (though this would be sort of an anti-pattern), this is up to the cluster administrator.
  2. MCO respects the lease. If someone other than the MCO holds the lease, MCO takes no action on that host.
  3. Administrator or controller releases the lease on one or more nodes at a time they see fit.
  4. Upon next reconcile, MCO acquires the lease for one or more nodes.
  5. MCO performs upgrade.
  6. MCO releases the lease for one or more nodes.

To put it plainly, the lease is an intent to perform or prevent some disruptive action. This exposes an interface which others may build on top of useful components or procedures. This also allows other components to have less knowledge about the MCO, and the MCO to have less knowledge of these other components.

@runcom
Copy link
Member

runcom commented Sep 10, 2020

To put it plainly, the lease is an intent to perform or prevent some disruptive action. This exposes an interface which others may build on top of useful components or procedures. This also allows other components to have less knowledge about the MCO, and the MCO to have less knowledge of these other components.

this sounds just great - I've been looking at the lease type since a while now and this seems kind of related and super useful

@runcom
Copy link
Member

runcom commented Sep 10, 2020

sounds also like similar to airlock https://github.com/poseidon/fleetlock#manual-intervention

@cgwalters
Copy link
Member Author

There's clearly overlap between "upgrade weighting" (this initial proposal) and "upgrade blocking/leases". We can clearly stretch "weighting" into including "blocking" per above. It doesn't solve interaction with other components like leases would (and we have that problem in general with machineAPI).

In all of this remember though it's not just about the single reboot - in the general case I think we should handle "accumulated disruption" where pods are rescheduled onto nodes that are then immediately chosen for update.

Well...OTOH leases would solve at least one "extreme" case where pods are pinned to nodes and we want basically per-node locking and updating at most one controlplane/worker at a time; in that case the admin or controller puts a lease on every other node.

Hmm so I guess I am leaning towards leases.

@crawford
Copy link
Contributor

I am very much against this ordering mechanism. It's not clear to me how a user will successfully make use of this, short of plotting out on a whiteboard all of the pods and machines and a plan for how pods will be moved and machines updated. That mental image couldn't be farther from what we are trying to do with OpenShift and Kubernetes in general. As Clayton alluded to above, all of this scheduling should be automatic and done without user intervention.

Unless I'm missing something, this is an anti-pattern and we should move in a different direction.

@michaelgugino
Copy link
Contributor

I am very much against this ordering mechanism. It's not clear to me how a user will successfully make use of this, short of plotting out on a whiteboard all of the pods and machines and a plan for how pods will be moved and machines updated. That mental image couldn't be farther from what we are trying to do with OpenShift and Kubernetes in general. As Clayton alluded to above, all of this scheduling should be automatic and done without user intervention.

Unless I'm missing something, this is an anti-pattern and we should move in a different direction.

This is a story that came up in 4.0 planning. We need a way for users to upgrade X% of their nodes to a given release and let that soak. This is somewhat a different concern than the original issue here.

I'm in fan of informing the scheduler of the kubelet version and having it prioritize the newest ones. Seems like it would be easy to do. 'Do scheduling as normal to find suitable, sort by kubelet version, pick highest in that list'. Of course, if the scheduler only finds the first-fit rather than the best-fit, then this would be a much larger change (but based on how preemption works, possibly works in the former case).

@cgwalters
Copy link
Member Author

It's not clear to me how a user will successfully make use of this, short of plotting out on a whiteboard all of the pods and machines and a plan for how pods will be moved and machines updated.

Nothing in this proposal requires that all nodes have a priority. Unlabeled nodes could e.g. be 0, then an admin could use negative values to be after the default, and positive ones before. For example, adding negative priority to a few critical nodes hosting the postgres database.

It also doesn't require planning out how pods move - that's a strongly related but orthogonal thing.

That said, this request is motivated by one request at an extreme to do exactly what you're saying - the admin wants to choose which nodes to update. In this case there isn't concern about interaction with e.g. PDBs because the pods are pinned to nodes.

@derekwaynecarr
Copy link
Member

derekwaynecarr commented Sep 16, 2020

I agree that marking the node as no longer schedulable is a nice way of identifying that a node is readying undergoing maintenance, I had been looking at a situation where I actually want a pod to schedule to a node, I just dont want the node upgraded. So an ability to cordon a node from an upgrade pool, but not from scheduling is the semantic I was trying to think through.

@derekwaynecarr
Copy link
Member

Another observation:

Should we add a rollout strategy concept on a machine config pool? One example rollout strategy would be update by topology, so I could ensure all nodes in zone X are updated before zone Y. This would aid workloads that are zone or location aware.

@megan-hall
Copy link

Hopefully this comment relates to this particular issue. During a cluster upgrade we (UXD) would like to be able to surface in the console the difference between a node that is 'ready and up to date' vs. a node that is 'ready and waiting to update'. Is there something we could do so that users can quickly distinguish between the two? Perhaps an annotation - waiting to update - or something like that?

@wking
Copy link
Member

wking commented Oct 6, 2020

One example rollout strategy would be update by topology, so I could ensure all nodes in zone X are updated before zone Y. This would aid workloads that are zone or location aware.

This would increase your exposure to zone-outages, wouldn't it? What if you had a workload that happened to be incompatible with the incoming MachineConfig. You could have PDBs that protect it from excessive degradation during a randomized rollout, but if it gets sqeezed into the old-MachineConfig zone A, zones B and C completely update, and then zone A has an outage, you'd be immediately thrust into an outage for that workload.

@cgwalters
Copy link
Member Author

cgwalters commented Oct 15, 2020

Initial implementation in #2162 and a separate PR in #2163

cgwalters added a commit to cgwalters/machine-config-operator that referenced this issue Oct 15, 2020
Today the MCO arbitrarily chooses a node to update from the candidates.
We want to allow admins to avoid specific nodes entirely (for as long
as they want) as well as guide upgrade ordering.

This replaces the defunct etcd-specific code with support for a generic
annotation `machineconfiguration.openshift.io/update-order` that allows
an external controller (and/or human) to do both of these.

Setting it to `0` will entirely skip that node for updates.  Otherwise,
higher values are preferred.

Closes: openshift#2059
cgwalters added a commit to cgwalters/machine-config-operator that referenced this issue Oct 15, 2020
Today the MCO arbitrarily chooses a node to update from the candidates.
We want to allow admins to avoid specific nodes entirely.

(Aside: This replaces the defunct etcd-specific ordering code that
 we didn't end up using)

Add an annotation `machineconfiguration.openshift.io/hold` that allows
an external controller (and/or human) to avoid specific nodes.

Related: openshift#2059
@zvonkok
Copy link

zvonkok commented Oct 15, 2020

This could help in preventing nodes from upgrading to a kernel that is not supported by an out-of-tree driver. These special resource nodes can only run the workload. Draining those could kill the workload since the "general" CPU nodes cannot handle them.
Customers could then try to build the drivers on the updated nodes without disrupting the special nodes.
The question is how many update cycles would this annotated node "survive"?

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 3, 2021
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 2, 2021
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link
Contributor

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member Author

/lifecycle frozen
This keeps coming up

@openshift-ci openshift-ci bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Sep 24, 2021
@cgwalters
Copy link
Member Author

Today, any admin that wants to fully control node upgrade ordering can:

First, pause the pool controlling those nodes (e.g. worker). The node controller will then not apply desiredConfig annotations on the targeted notes. However, crucially the render controller will still generate new rendered machineconfig.

So for example, when you upgrade the cluster to pick up a new kernel security update, the worker pool will update to a new spec.configuration. (Note if you've paused the worker but not control plane, then the control plane will have updated)

Now, if you want to manually target a node for update, it should work to oc edit node/<somenode> and update the machineconfiguration.openshift.io/desiredConfig annotation to the current value of spec.configuration in the worker pool.

The machine-config-daemon (daemonset) running on that node will then react to that annotation, and apply it (drain + reboot as needed) in the same way it would as if the node controller had changed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

Successfully merging a pull request may close this issue.