Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

host-sp-comms: do not reboot host on boot failure #1618

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cbiffle
Copy link
Collaborator

@cbiffle cbiffle commented Feb 14, 2024

This is in response to #1613. If the host reports a boot failure (such as a phase mismatch, but not limited to that reason) simply rebooting it blindly is unlikely to fix the problem. We need intervention from a higher power (the control plane) to fix the issue.

So to avoid a bootloop that wastes energy and overwrites our circular buffers with spam, this change alters the response to the IPCC Request Reboot message if received shortly after a Boot Failed message -- it is interpreted as a power off request.

Fixes #1614.

This is in response to #1613. If the host reports a boot failure (such
as a phase mismatch, but not limited to that reason) simply rebooting it
blindly is unlikely to fix the problem. We need intervention from a
higher power (the control plane) to fix the issue.

So to avoid a bootloop that wastes energy and overwrites our circular
buffers with spam, this change alters the response to the IPCC Request
Reboot message if received shortly after a Boot Failed message -- it is
interpreted as a power off request.

Fixes #1614.
let response = match request {
HostToSp::_Unused => {
Some(SpToHost::DecodeFailure(DecodeFailureReason::Deserialize))
}
HostToSp::RequestReboot => {
action = Some(Action::RebootHost);
if core::mem::take(&mut self.host_boot_just_failed) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of mem::take here is very cute! But, IIUC, since we clear the flag later if action is Some, we don't actually need to clear it when we read the flag's value here, do we? Since both branches set action to Some regardless?

Up to you whether this actually ought to change, but at a first glance it kind of appeared that this was the only case where we clear the flag, so it was a bit less obvious that any branch where action is Some clears the flag (despite the comment). 🤷‍♀️

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wound up doing both here to make it really really unlikely that we forget to clear the flag. Arguably unnecessary. What do you think would be clearest?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe leaving it as is, but commenting that this also gets cleared because we've set action = Some(...)? Honestly, I think it's not really a big deal --- mostly, I wanted to make sure my understanding was correct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that it threw you suggests that I need to make it more clear, so I'll see about adding some comments.


// Record that this failure happened, which will change the
// behavior of the reboot request we expect to follow shortly.
self.host_boot_just_failed = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about the reason (which we're ..'ing out in the match arm), or should all boot failure reasons be treated equally? Maybe a question for @citrus-it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be my strong preference to have the same behavior regardless of reported reason, but I'm open to changing that if somebody wants to make an argument!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't ever expect to see any boot failures apart from in the lab where they're triggered pretty often in my experience. It just takes somebody to do something like configure net boot and forget to start the boot server on the butler, or start it with a typo in the options. We should probably send a heads-up about this new behaviour.

It is very unlikely that any of the boot failure reasons indicate that there would be a different result if retried, I think they should all be treated equally as requiring intervention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I did just hit this exact case the other day personally because I wasn't fast enough in the lab with net boot/partially distracted. Is there something that makes it very clear from the sequencer state or related that we're in this case as a human looking at the system?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior as implemented in this commit won't affect the sequencer state -- the machine will just power itself down on boot failure and wait to be turned back on. We don't currently have a great way of exposing "the last problem with this was _____" data in production builds. As a user, what sort of experience would you hope for?

action = Some(Action::RebootHost);
if core::mem::take(&mut self.host_boot_just_failed) {
// Do _not_ reboot. Reinterpret this as a poweroff request.
action = Some(Action::PowerOffHost);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to look at this more tomorrow, but we might need to address the TODO I left in handle_jefe_notification. There may be a sequence of reboot operations where we still reboot even if we set this action.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know, let's talk tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have convinced myself this is okay (where "this" is really "the handling of self.reboot_state"):

  1. self.reboot_state starts out None
  2. It can only go from None to Some(_) inside a call to power_off_host(true)
  3. Once it goes to Some(_), it will eventually cause the host to come back on through one of these paths:
  • If power_off_host(true) succeeds in asking the sequencer to go to A2, it's set to Some(WaitingForA2). Once we get a notification from jefe that we've entered A2, it will change to Some(WaitingInA2RebootDelay) and start the Timers::WaitingInA2ToReboot timer. Once that timer fires, we'll enter handle_reboot_waiting_in_a2_timer, set the state to A0, and set self.reboot_state back to `None.
  • If power_off_host(true) fails to ask the sequencer to go to A2 because we're already in A2, it's set to Some(WaitingInA2RebootDelay) and we start the Timers::WaitingInA2ToReboot timer. Once that timer fires, we'll enter handle_reboot_waiting_in_a2_timer, set the state to A0, and set self.reboot_state back to `None.

power_off_host(true) is only called in two places: if we get a jefe notification that we've gone to A0Reset, or if the host sends us HostToSp::RequestReboot. If either of those happened and before we go through the above timer-based process (~5 seconds) to reboot the host we landed here with a host boot fail, the PowerOffHost would be effectively ignored because we'd already be in the process of rebooting the host. But that doesn't seem likely, and if it happens it's maybe fine to honor the "first request" that triggered the reboot?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, if the host sends back-to-back "power off" and "reboot" requests, we'll behave ... rather strangely. I wonder if it might be worth adding code to stop listening to IPCC entirely starting at any reboot/powerdown request and continuing up until we believe the host to have been powered back on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SP should not auto-reboot host in response to a host-reported boot failure
5 participants