Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SP should not auto-reboot host in response to a host-reported boot failure #1614

Open
cbiffle opened this issue Feb 14, 2024 · 0 comments · May be fixed by #1618
Open

SP should not auto-reboot host in response to a host-reported boot failure #1614

cbiffle opened this issue Feb 14, 2024 · 0 comments · May be fixed by #1618

Comments

@cbiffle
Copy link
Collaborator

cbiffle commented Feb 14, 2024

(Fallout from #1613)

Currently, if the host reports a boot failure over the IPCC link, we respond by recording the information and resuming normal business. When the host immediately follows that up with a reboot request, we dutifully reboot the host.

Because we haven't taken any additional actions to fix the boot failure (by, for instance, flipping the host flash mux), this will probably always produce a reboot loop.

While this sort of reboot loop is likely not destructive, it's distracting: the machine cycles, the logs/ringbufs get overwritten, power is wasted, etc. I think after a boot failure like this, we should probably not attempt to boot the host until we have reason to believe the failure has been repaired.

The SP itself doesn't have sufficient context to know how to "repair" such a failure. If the failure was hit while attempting a recover image boot through Wicket, for instance, we specifically do not want to do an automatic slot fallback. If we hit it during a production software upgrade, we might, depending on circumstances, want to do a slot fallback. The right answer in basically all cases appears to be: escalate to the control plane, where context is more easily available.

So, I think we should stop rebooting the host after a boot failure, period, and wait for messages over the network. The boot failure is stored in a place the control plane can get to it (in the control-plane-agent). If we had a way of proactively sounding an alarm, we could do that, but for now it'd have to be polled.

Concretely, I discussed this briefly with @wesolows and the simplest thing appears to be:

  1. Honor reboot requests from the host normally, except
  2. If we get a host boot failure message, set a flag that causes the next reboot request to be interpreted as "power down and intervene."
cbiffle added a commit that referenced this issue Feb 14, 2024
This is in response to #1613. If the host reports a boot failure (such
as a phase mismatch, but not limited to that reason) simply rebooting it
blindly is unlikely to fix the problem. We need intervention from a
higher power (the control plane) to fix the issue.

So to avoid a bootloop that wastes energy and overwrites our circular
buffers with spam, this change alters the response to the IPCC Request
Reboot message if received shortly after a Boot Failed message -- it is
interpreted as a power off request.

Fixes #1614.
@cbiffle cbiffle linked a pull request Feb 14, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant