Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement region replacement for Volumes #5683

Closed
wants to merge 6 commits into from

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented May 1, 2024

When a disk is expunged, any region that was on that disk is assumed to be gone. A single disk expungement can put many Volumes into degraded states, as one of the three mirrors of a region set is now gone. Volumes that are degraded in this way remain degraded until a new region is swapped in, and the Upstairs performs the necessary repair operation (either through a Live Repair or Reconciliation). Nexus can only initiate these repairs - it does not participate in them, instead requesting that a Crucible Upstairs perform the repair.

These repair operations can only be done by an Upstairs running as part of an activated Volume: either Nexus has to send this Volume to a Pantry and repair it there, or Nexus has to talk to a propolis that has that active Volume. Further complicating things is that the Volumes in question can be activated and deactivated as a result of user action, namely starting and stopping Instances. This will interrupt any on-going repair. This is ok! Both operations support being interrupted, but as a result it's then Nexus' job to continually monitor these repair operations and initiate further operations if the current one is interrupted.

Nexus starts by creating region replacement requests, either manually or as a result of disk expungement. These region replacement requests go through the following states:

    Requested   <--
                  |
        |         |
        v         |
                  |
    Allocating  --

        |
        v

     Running    <--
                  |
        |         |
        v         |
                  |
     Driving    --

        |
        v

 ReplacementDone  <--
                    |
        |           |
        v           |
                    |
    Completing    --

        |
        v

    Completed

A single saga invocation is not enough to continually make sure a Volume is being repaired, so region replacement is structured as series of background tasks and saga invocations from those background tasks.

Here's a high level summary:

  • a region replacement background task:

    • looks for disks that have been expunged and inserts region replacement requests into CRDB with state Requested

    • looks for all region replacemnt requests in state Requested (picking up new requests and requests that failed to transition to Running), and invokes a region replacement start saga.

  • the region replacement start saga:

    • transitions the request to state Allocating, blocking out other invocations of the same saga

    • allocates a new replacement region

    • alters the Volume Construction Request by swapping out the old region for the replacement one

    • transitions the request to state Running

    • any unwind will transition the request back to the Requested state.

  • a region replacement drive background task:

    • looks for requests with state Running, and invokes the region replacement drive saga for those requests

    • looks for requests with state ReplacementDone, and invokes the region replacement finish saga for those requests

  • the region replacement drive saga will:

    • transition a request to state Driving, again blocking out other invocations of the same saga

    • check if Nexus has taken an action to initiate a repair yet. if not, then one is needed. if it has previously initiated a repair operation, the state of the system is examined: is that operation still running? has something changed? further action may be required depending on this observation.

    • if an action is required, Nexus will prepare an action that will initiate either Live Repair or Reconciliation based on the current observed state of the system.

    • that action is then executed. if there was an error, then the saga unwinds. if it was successful, it is recorded as a "repair step" in CRDB and will be checked the next time the saga runs.

    • if Nexus observed an Upstairs telling it that a repair was completed or not necessary, then the request is placed into the ReplacementDone state, otherwise it is placed back into the Running state. if the saga unwinds, it unwinds back to the Running state.

  • finally, the region replacement finish saga will:

More detailed documentation is provided in each of the region replacement saga's beginning docstrings.

Testing was done manually using the Canada region using the following test cases:

  • a disk needing repair is attached to a instance for the duration of the repair

  • a disk needing repair is attached to a instance that is migrated mid-repair

  • a disk needing repair is attached to a instance that is stopped mid-repair

  • a disk needing repair is attached to a instance that is stopped mid-repair, then started in the middle of the pantry's repair

  • a detached disk needs repair

  • a detached disk needs repair, and is then attached to an instance that is then started

  • a sled is expunged, causing region replacement requests for all regions on it

Fixes #3886
Fixes #5191

@jmpesp jmpesp requested review from andrewjstone and leftwo May 1, 2024 20:06
jmpesp added 6 commits May 2, 2024 02:57
When a disk is expunged, any region that was on that disk is assumed to
be gone. A single disk expungement can put many Volumes into degraded
states, as one of the three mirrors of a region set is now gone. Volumes
that are degraded in this way remain degraded until a new region is
swapped in, and the Upstairs performs the necessary repair operation
(either through a Live Repair or Reconciliation). Nexus can only
initiate these repairs - it does not participate in them, instead
requesting that a Crucible Upstairs perform the repair.

These repair operations can only be done by an Upstairs running as part
of an activated Volume: either Nexus has to send this Volume to a Pantry
and repair it there, or Nexus has to talk to a propolis that has that
active Volume. Further complicating things is that the Volumes in
question can be activated and deactivated as a result of user action,
namely starting and stopping Instances. This will interrupt any on-going
repair. This is ok! Both operations support being interrupted, but as a
result it's then Nexus' job to continually monitor these repair
operations and initiate further operations if the current one is
interrupted.

Nexus starts by creating region replacement requests, either manually or
as a result of disk expungement. These region replacement requests go
through the following states:

        Requested   <--
                      |
            |         |
            v         |
                      |
        Allocating  --

            |
            v

         Running    <--
                      |
            |         |
            v         |
                      |
         Driving    --

            |
            v

     ReplacementDone  <--
                        |
            |           |
            v           |
                        |
        Completing    --

            |
            v

        Completed

A single saga invocation is not enough to continually make sure a Volume
is being repaired, so region replacement is structured as series of
background tasks and saga invocations from those background tasks.

Here's a high level summary:

- a `region replacement` background task:

  - looks for disks that have been expunged and inserts region
    replacement requests into CRDB with state `Requested`

  - looks for all region replacemnt requests in state `Requested`
    (picking up new requests and requests that failed to transition to
    `Running`), and invokes a `region replacement start` saga.

- the `region replacement start` saga:

  - transitions the request to state `Allocating`, blocking out other
    invocations of the same saga

  - allocates a new replacement region

  - alters the Volume Construction Request by swapping out the old
    region for the replacement one

  - transitions the request to state `Running`

  - any unwind will transition the request back to the `Requested`
    state.

- a `region replacement drive` background task:

  - looks for requests with state `Running`, and invokes the `region
    replacement drive` saga for those requests

  - looks for requests with state `ReplacementDone`, and invokes the
    `region replacement finish` saga for those requests

- the `region replacement drive` saga will:

  - transition a request to state `Driving`, again blocking out other
    invocations of the same saga

  - check if Nexus has taken an action to initiate a repair yet. if not,
    then one is needed. if it _has_ previously initiated a repair
    operation, the state of the system is examined: is that operation
    still running? has something changed? further action may be required
    depending on this observation.

  - if an action is required, Nexus will prepare an action that will
    initiate either Live Repair or Reconciliation based on the current
    observed state of the system.

  - that action is then executed. if there was an error, then the saga
    unwinds. if it was successful, it is recorded as a "repair step" in
    CRDB and will be checked the next time the saga runs.

  - if Nexus observed an Upstairs telling it that a repair was completed
    or not necessary, then the request is placed into the
    `ReplacementDone` state, otherwise it is placed back into the
    `Running` state. if the saga unwinds, it unwinds back to the
    `Running` state.

- finally, the `region replacement finish` saga will:

  - transition a request into `Completing`

  - delete the old region by deleting a transient Volume that refers to
    it (in the case where a sled or disk is actually physically gone,
    expunging that will trigger oxidecomputer#4331, which needs to be fixed!)

  - transition the request to the `Complete` state

More detailed documentation is provided in each of the region
replacement saga's beginning docstrings.

Testing was done manually using the Canada region using the following
test cases:

- a disk needing repair is attached to a instance for the duration of
  the repair

- a disk needing repair is attached to a instance that is migrated
  mid-repair

- a disk needing repair is attached to a instance that is stopped
  mid-repair

- a disk needing repair is attached to a instance that is stopped
  mid-repair, then started in the middle of the pantry's repair

- a detached disk needs repair

- a detached disk needs repair, and is then attached to an instance that
  is then started

- a sled is expunged, causing region replacement requests for all
  regions on it

Fixes oxidecomputer#3886
Fixes oxidecomputer#5191
fix case where mark_region_replacement_as_done wasn't changing the state
of a request for which there was a drive saga running.
Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

James, this is epic work. I've only given it a cursory look so far, and will need to spend much more time digging in. Given how far behind current main this is, and the slight alleviation of urgency, I was wondering if you could split this up into multiple logical PRs to make it easier to review. I think this should be feasible by splitting along datastore queries and then saga / background task lines. Each of those can be added to the code and tested without being used. The background tasks for instance don't need to be enabled immediately and the sagas don't need to be triggered by the background tasks and or omdb. The OMDB change can come in last. My gut feeling is that this would also make it easier to test things in isolation, as you may see issues while doing the split and writing individual commit messages.

.transaction_async(|conn| async move {
use db::schema::region_replacement::dsl;

match (args.state, args.after) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Rather than match on different filters, you could create a query without the filters, and then append them. This should be much less code. Here's an example:

use db::schema::disk::dsl;
let mut query = dsl::disk.into_boxed();
if !fetch_opts.include_deleted {
query = query.filter(dsl::time_deleted.is_null());
}
let disks = query
.limit(i64::from(u32::from(fetch_opts.fetch_limit)))
.select(Disk::as_select())
.load_async(&*datastore.pool_connection_for_tests().await?)
.await
.context("loading disks")?;

//! TODO this is currently a placeholder for a future PR
//! This task's responsibility is to create region replacement requests when
//! physical disks are expunged, and trigger the region replacement start saga
//! for any requests that are in state "Requested". See the documentation there
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the documentation where? The region replacement start saga?

@@ -109,6 +109,11 @@ blueprints.period_secs_execute = 600
sync_service_zone_nat.period_secs = 30
switch_port_settings_manager.period_secs = 30
region_replacement.period_secs = 30
# The driver task should wake up frequently, something like every 10 seconds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's unfortunate. It would be nice if this could be redacted out just for this message, but I'm not sure if that's possible.

// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

//! # first, some Crucible background #
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great comment!

@jmpesp
Copy link
Contributor Author

jmpesp commented May 17, 2024

Closing this, will split it up!

@jmpesp jmpesp closed this May 17, 2024
jmpesp added a commit to jmpesp/omicron that referenced this pull request May 17, 2024
Splitting up oxidecomputer#5683 first by separating out the DB models, queries, and
schema changes required:

This commit adds a Region Replacement record, which is a request to
replace a region in a volume. It transitions through the following
states:

        Requested   <--
                      |
            |         |
            v         |
                      |
        Allocating  --

            |
            v

         Running    <--
                      |
            |         |
            v         |
                      |
         Driving    --

            |
            v

     ReplacementDone  <--
                        |
            |           |
            v           |
                        |
        Completing    --

            |
            v

        Completed

which are captured in the `RegionReplacementState` enum. Transitioning
from Requested to Running is the responsibility of the "start" saga,
iterating between Running and Driving is the responsibility of the
"drive" saga, and transitioning from ReplacementDone to Completed is the
responsibility of the "finish" saga. All of these will come in
subsequent PRs.

The state transitions themselves are performed by these sagas and all
involve a query that:

- checks that the starting state (and other values as required) make
  sense
- updates the state while setting a unique `operating_saga_id` id (and
  any other fields as appropriate)

As multiple background tasks will be waking up, checking to see what
sagas need to be triggered, and requesting that these region replacement
sagas run, this is meant to block multiple sagas from running at the
same time in an effort to cut down on interference - most will unwind at
the first step instead of somewhere in the middle.

As region replacement takes place, Nexus will be making calls to
services in order to trigger the necessary Crucible operations meant to
actually perform th replacement. These steps are recorded in the
database so that they can be consulted by subsequent steps, and
additionally act as breadcrumbs if there is an issue.

Nexus should take care to only replace one region (or snapshot!) for a
volume at a time. Technically, the Upstairs can support two at a time,
but codifying "only one at a time" is safer, and does not allow the
possiblity for a Nexus bug to replace all three regions of a region set
at a time (aka total data loss!). This "one at a time" constraint is
enforced by each repair also creating a VolumeRepair record, a table for
which there is a UNIQUE CONSTRAINT on the volume ID.

The `volume_replace_region` function  is also included in this PR. In a
single transaction, this will:

- set the target region's volume id to the replacement's volume id
- set the replacement region's volume id to the target's volume id
- update the target volume's construction request to replace the target
  region's SocketAddrV6 with the replacement region's

This is called from the "start" saga, after allocating the replacement
region, and is meant to transition the Volume's construction request
from "indefinitely degraded, pointing to region that is gone" to
"currently degraded, but can be repaired".
jmpesp added a commit to jmpesp/omicron that referenced this pull request May 17, 2024
Splitting up oxidecomputer#5683 first by separating out the DB models, queries, and
schema changes required:

1. region replacement records

This commit adds a Region Replacement record, which is a request to
replace a region in a volume. It transitions through the following
states:

        Requested   <--
                      |
            |         |
            v         |
                      |
        Allocating  --

            |
            v

         Running    <--
                      |
            |         |
            v         |
                      |
         Driving    --

            |
            v

     ReplacementDone  <--
                        |
            |           |
            v           |
                        |
        Completing    --

            |
            v

        Completed

which are captured in the `RegionReplacementState` enum. Transitioning
from Requested to Running is the responsibility of the "start" saga,
iterating between Running and Driving is the responsibility of the
"drive" saga, and transitioning from ReplacementDone to Completed is the
responsibility of the "finish" saga. All of these will come in
subsequent PRs.

The state transitions themselves are performed by these sagas and all
involve a query that:

- checks that the starting state (and other values as required) make
  sense
- updates the state while setting a unique `operating_saga_id` id (and
  any other fields as appropriate)

As multiple background tasks will be waking up, checking to see what
sagas need to be triggered, and requesting that these region replacement
sagas run, this is meant to block multiple sagas from running at the
same time in an effort to cut down on interference - most will unwind at
the first step instead of somewhere in the middle.

2. region replacement step records

As region replacement takes place, Nexus will be making calls to
services in order to trigger the necessary Crucible operations meant to
actually perform th replacement. These steps are recorded in the
database so that they can be consulted by subsequent steps, and
additionally act as breadcrumbs if there is an issue.

3. vollume repair records

Nexus should take care to only replace one region (or snapshot!) for a
volume at a time. Technically, the Upstairs can support two at a time,
but codifying "only one at a time" is safer, and does not allow the
possiblity for a Nexus bug to replace all three regions of a region set
at a time (aka total data loss!). This "one at a time" constraint is
enforced by each repair also creating a VolumeRepair record, a table for
which there is a UNIQUE CONSTRAINT on the volume ID.

4. also, the `volume_replace_region` function

The `volume_replace_region` function is also included in this PR. In a
single transaction, this will:

- set the target region's volume id to the replacement's volume id
- set the replacement region's volume id to the target's volume id
- update the target volume's construction request to replace the target
  region's SocketAddrV6 with the replacement region's

This is called from the "start" saga, after allocating the replacement
region, and is meant to transition the Volume's construction request
from "indefinitely degraded, pointing to region that is gone" to
"currently degraded, but can be repaired".
jmpesp added a commit to jmpesp/omicron that referenced this pull request May 17, 2024
Splitting up oxidecomputer#5683 first by separating out the DB models, queries, and
schema changes required:

1. region replacement records

This commit adds a Region Replacement record, which is a request to
replace a region in a volume. It transitions through the following
states:

        Requested   <--
                      |
            |         |
            v         |
                      |
        Allocating  --

            |
            v

         Running    <--
                      |
            |         |
            v         |
                      |
         Driving    --

            |
            v

     ReplacementDone  <--
                        |
            |           |
            v           |
                        |
        Completing    --

            |
            v

        Completed

which are captured in the `RegionReplacementState` enum. Transitioning
from Requested to Running is the responsibility of the "start" saga,
iterating between Running and Driving is the responsibility of the
"drive" saga, and transitioning from ReplacementDone to Completed is the
responsibility of the "finish" saga. All of these will come in
subsequent PRs.

The state transitions themselves are performed by these sagas and all
involve a query that:

- checks that the starting state (and other values as required) make
  sense
- updates the state while setting a unique `operating_saga_id` id (and
  any other fields as appropriate)

As multiple background tasks will be waking up, checking to see what
sagas need to be triggered, and requesting that these region replacement
sagas run, this is meant to block multiple sagas from running at the
same time in an effort to cut down on interference - most will unwind at
the first step instead of somewhere in the middle.

2. region replacement step records

As region replacement takes place, Nexus will be making calls to
services in order to trigger the necessary Crucible operations meant to
actually perform th replacement. These steps are recorded in the
database so that they can be consulted by subsequent steps, and
additionally act as breadcrumbs if there is an issue.

3. volume repair records

Nexus should take care to only replace one region (or snapshot!) for a
volume at a time. Technically, the Upstairs can support two at a time,
but codifying "only one at a time" is safer, and does not allow the
possiblity for a Nexus bug to replace all three regions of a region set
at a time (aka total data loss!). This "one at a time" constraint is
enforced by each repair also creating a VolumeRepair record, a table for
which there is a UNIQUE CONSTRAINT on the volume ID.

4. also, the `volume_replace_region` function

The `volume_replace_region` function is also included in this PR. In a
single transaction, this will:

- set the target region's volume id to the replacement's volume id
- set the replacement region's volume id to the target's volume id
- update the target volume's construction request to replace the target
  region's SocketAddrV6 with the replacement region's

This is called from the "start" saga, after allocating the replacement
region, and is meant to transition the Volume's construction request
from "indefinitely degraded, pointing to region that is gone" to
"currently degraded, but can be repaired".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants