Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all disk deletes hang while Crucible downstairs is unreachable #4331

Open
augustuswm opened this issue Oct 24, 2023 · 1 comment · May be fixed by #5789
Open

all disk deletes hang while Crucible downstairs is unreachable #4331

augustuswm opened this issue Oct 24, 2023 · 1 comment · May be fixed by #5789
Assignees
Milestone

Comments

@augustuswm
Copy link
Contributor

augustuswm commented Oct 24, 2023

When the disk_delete saga runs, one of its responsibilities is to clean up regions that are freed as a result of a volume being deleted (svd_delete_freed_crucible_regions). This uses the following query to find regions that need to be cleaned up:

select * from region
inner join volume on region.volume_id = volume.id
inner join dataset on region.dataset_id = dataset.id
left join region_snapshot on region_snapshot.region_id = region.id or region_snapshot.dataset_id = dataset.id
where (region_snapshot.volume_references = 0 or region_snapshot.volume_references is null)
and not volume.time_deleted is null;

The results of this looks something like (keys redacted):

                   id                  |         time_created          |         time_modified         |              dataset_id              |              volume_id               | block_size | blocks_per_extent | extent_count |                  id                  |         time_created         |        time_modified         |         time_deleted          | rcgen |                                                                                                                                                                                                                     data                                                                                                                                                                                                                     |resources_to_clean_up                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |                  id                  |         time_created          |         time_modified         | time_deleted | rcgen |               pool_id                |          ip           | port  |   kind   |  size_used  | dataset_id | region_id | snapshot_id | snapshot_addr | volume_references

  341e684e-c49a-412d-99b1-9a64ce33a6c9 | 2023-10-16 20:44:49.184976+00 | 2023-10-16 20:44:49.184976+00 | 2074a935-c0b3-4c4f-aae5-a29adae3e1ac | 1f86e688-937a-4e58-81f2-a30d697e92c0 |        512 |            131072 |          160 | 1f86e688-937a-4e58-81f2-a30d697e92c0 | 2023-10-16 20:44:50.00492+00 | 2023-10-16 20:44:50.00492+00 | 2023-10-23 17:00:21.299613+00 |     1 | {"type":"volume","block_size":512,"id":"1f86e688-937a-4e58-81f2-a30d697e92c0","sub_volumes":[{"type":"region","block_size":512,"blocks_per_extent":131072,"extent_count":160,"gen":1,"opts":{"id":"1f86e688-937a-4e58-81f2-a30d697e92c0","key":"............................................","lossy":false,"read_only":false,"target":["[fd00:1122:3344:112::8]:19001","[fd00:1122:3344:11b::7]:19000","[fd00:1122:3344:108::7]:19000"]}}]} | {"V1":{"datasets_and_regions":[[{"identity":{"id":"2074a935-c0b3-4c4f-aae5-a29adae3e1ac","time_created":"2023-09-02T13:09:09.879257Z","time_modified":"2023-09-02T13:09:09.879257Z"},"time_deleted":null,"rcgen":1,"pool_id":"ac663368-45fb-447c-811e-561c68e37bdd","ip":"fd00:1122:3344:112::8","port":32345,"kind":"Crucible","size_used":21474836480},{"identity":{"id":"341e684e-c49a-412d-99b1-9a64ce33a6c9","time_created":"2023-10-16T20:44:49.184976Z","time_modified":"2023-10-16T20:44:49.184976Z"},"dataset_id":"2074a935-c0b3-4c4f-aae5-a29adae3e1ac","volume_id":"1f86e688-937a-4e58-81f2-a30d697e92c0","block_size":512,"blocks_per_extent":131072,"extent_count":160}],[{"identity":{"id":"4b7a7052-f8e8-4196-8d6b-315943986ce6","time_created":"2023-09-02T13:09:09.879238Z","time_modified":"2023-09-02T13:09:09.879238Z"},"time_deleted":null,"rcgen":1,"pool_id":"a549421c-2f12-45cc-b691-202f0a9bfa8b","ip":"fd00:1122:3344:108::7","port":32345,"kind":"Crucible","size_used":10737418240},{"identity":{"id":"1c33fde1-34d1-4686-843d-9855c74f2e29","time_created":"2023-10-16T20:44:49.184976Z","time_modified":"2023-10-16T20:44:49.184976Z"},"dataset_id":"4b7a7052-f8e8-4196-8d6b-315943986ce6","volume_id":"1f86e688-937a-4e58-81f2-a30d697e92c0","block_size":512,"blocks_per_extent":131072,"extent_count":160}],[{"identity":{"id":"864fdb33-a36d-4bb0-96e6-a3b3b057ba8b","time_created":"2023-09-02T13:09:09.879256Z","time_modified":"2023-09-02T13:09:09.879256Z"},"time_deleted":null,"rcgen":1,"pool_id":"f59674a4-437e-4846-9abf-a71679082c82","ip":"fd00:1122:3344:11b::7","port":32345,"kind":"Crucible","size_used":21474836480},{"identity":{"id":"6a84deca-8b0f-475d-bbe2-1c684bfc0ec1","time_created":"2023-10-16T20:44:49.184976Z","time_modified":"2023-10-16T20:44:49.184976Z"},"dataset_id":"864fdb33-a36d-4bb0-96e6-a3b3b057ba8b","volume_id":"1f86e688-937a-4e58-81f2-a30d697e92c0","block_size":512,"blocks_per_extent":131072,"extent_count":160}]],"datasets_and_snapshots":[]}} | 2074a935-c0b3-4c4f-aae5-a29adae3e1ac | 2023-09-02 13:09:09.879257+00 | 2023-09-02 13:09:09.879257+00 | NULL         |     1 | ac663368-45fb-447c-811e-561c68e37bdd | fd00:1122:3344:112::8 | 32345 | crucible | 21474836480 | NULL       | NULL      | NULL        | NULL          |              NULL
  1c33fde1-34d1-4686-843d-9855c74f2e29 | 2023-10-16 20:44:49.184976+00 | 2023-10-16 20:44:49.184976+00 | 4b7a7052-f8e8-4196-8d6b-315943986ce6 | 1f86e688-937a-4e58-81f2-a30d697e92c0 |        512 |            131072 |          160 | 1f86e688-937a-4e58-81f2-a30d697e92c0 | 2023-10-16 20:44:50.00492+00 | 2023-10-16 20:44:50.00492+00 | 2023-10-23 17:00:21.299613+00 |     1 | {"type":"volume","block_size":512,"id":"1f86e688-937a-4e58-81f2-a30d697e92c0","sub_volumes":[{"type":"region","block_size":512,"blocks_per_extent":131072,"extent_count":160,"gen":1,"opts":{"id":"1f86e688-937a-4e58-81f2-a30d697e92c0","key":"............................................","lossy":false,"read_only":false,"target":["[fd00:1122:3344:112::8]:19001","[fd00:1122:3344:11b::7]:19000","[fd00:1122:3344:108::7]:19000"]}}]} | {"V1":{"datasets_and_regions":[[{"identity":{"id":"2074a935-c0b3-4c4f-aae5-a29adae3e1ac","time_created":"2023-09-02T13:09:09.879257Z","time_modified":"2023-09-02T13:09:09.879257Z"},"time_deleted":null,"rcgen":1,"pool_id":"ac663368-45fb-447c-811e-561c68e37bdd","ip":"fd00:1122:3344:112::8","port":32345,"kind":"Crucible","size_used":21474836480},{"identity":{"id":"341e684e-c49a-412d-99b1-9a64ce33a6c9","time_created":"2023-10-16T20:44:49.184976Z","time_modified":"2023-10-16T20:44:49.184976Z"},"dataset_id":"2074a935-c0b3-4c4f-aae5-a29adae3e1ac","volume_id":"1f86e688-937a-4e58-81f2-a30d697e92c0","block_size":512,"blocks_per_extent":131072,"extent_count":160}],[{"identity":{"id":"4b7a7052-f8e8-4196-8d6b-315943986ce6","time_created":"2023-09-02T13:09:09.879238Z","time_modified":"2023-09-02T13:09:09.879238Z"},"time_deleted":null,"rcgen":1,"pool_id":"a549421c-2f12-45cc-b691-202f0a9bfa8b","ip":"fd00:1122:3344:108::7","port":32345,"kind":"Crucible","size_used":10737418240},{"identity":{"id":"1c33fde1-34d1-4686-843d-9855c74f2e29","time_created":"2023-10-16T20:44:49.184976Z","time_modified":"2023-10-16T20:44:49.184976Z"},"dataset_id":"4b7a7052-f8e8-4196-8d6b-315943986ce6","volume_id":"1f86e688-937a-4e58-81f2-a30d697e92c0","block_size":512,"blocks_per_extent":131072,"extent_count":160}],[{"identity":{"id":"864fdb33-a36d-4bb0-96e6-a3b3b057ba8b","time_created":"2023-09-02T13:09:09.879256Z","time_modified":"2023-09-02T13:09:09.879256Z"},"time_deleted":null,"rcgen":1,"pool_id":"f59674a4-437e-4846-9abf-a71679082c82","ip":"fd00:1122:3344:11b::7","port":32345,"kind":"Crucible","size_used":21474836480},{"identity":{"id":"6a84deca-8b0f-475d-bbe2-1c684bfc0ec1","time_created":"2023-10-16T20:44:49.184976Z","time_modified":"2023-10-16T20:44:49.184976Z"},"dataset_id":"864fdb33-a36d-4bb0-96e6-a3b3b057ba8b","volume_id":"1f86e688-937a-4e58-81f2-a30d697e92c0","block_size":512,"blocks_per_extent":131072,"extent_count":160}]],"datasets_and_snapshots":[]}} | 4b7a7052-f8e8-4196-8d6b-315943986ce6 | 2023-09-02 13:09:09.879238+00 | 2023-09-02 13:09:09.879238+00 | NULL         |     1 | a549421c-2f12-45cc-b691-202f0a9bfa8b | fd00:1122:3344:108::7 | 32345 | crucible | 10737418240 | NULL       | NULL      | NULL        | NULL          |              NULL
  6a84deca-8b0f-475d-bbe2-1c684bfc0ec1 | 2023-10-16 20:44:49.184976+00 | 2023-10-16 20:44:49.184976+00 | 864fdb33-a36d-4bb0-96e6-a3b3b057ba8b | 1f86e688-937a-4e58-81f2-a30d697e92c0 |        512 |            131072 |          160 | 1f86e688-937a-4e58-81f2-a30d697e92c0 | 2023-10-16 20:44:50.00492+00 | 2023-10-16 20:44:50.00492+00 | 2023-10-23 17:00:21.299613+00 |     1 | {"type":"volume","block_size":512,"id":"1f86e688-937a-4e58-81f2-a30d697e92c0","sub_volumes":[{"type":"region","block_size":512,"blocks_per_extent":131072,"extent_count":160,"gen":1,"opts":{"id":"1f86e688-937a-4e58-81f2-a30d697e92c0","key":"............................................","lossy":false,"read_only":false,"target":["[fd00:1122:3344:112::8]:19001","[fd00:1122:3344:11b::7]:19000","[fd00:1122:3344:108::7]:19000"]}}]} | {"V1":{"datasets_and_regions":[[{"identity":{"id":"2074a935-c0b3-4c4f-aae5-a29adae3e1ac","time_created":"2023-09-02T13:09:09.879257Z","time_modified":"2023-09-02T13:09:09.879257Z"},"time_deleted":null,"rcgen":1,"pool_id":"ac663368-45fb-447c-811e-561c68e37bdd","ip":"fd00:1122:3344:112::8","port":32345,"kind":"Crucible","size_used":21474836480},{"identity":{"id":"341e684e-c49a-412d-99b1-9a64ce33a6c9","time_created":"2023-10-16T20:44:49.184976Z","time_modified":"2023-10-16T20:44:49.184976Z"},"dataset_id":"2074a935-c0b3-4c4f-aae5-a29adae3e1ac","volume_id":"1f86e688-937a-4e58-81f2-a30d697e92c0","block_size":512,"blocks_per_extent":131072,"extent_count":160}],[{"identity":{"id":"4b7a7052-f8e8-4196-8d6b-315943986ce6","time_created":"2023-09-02T13:09:09.879238Z","time_modified":"2023-09-02T13:09:09.879238Z"},"time_deleted":null,"rcgen":1,"pool_id":"a549421c-2f12-45cc-b691-202f0a9bfa8b","ip":"fd00:1122:3344:108::7","port":32345,"kind":"Crucible","size_used":10737418240},{"identity":{"id":"1c33fde1-34d1-4686-843d-9855c74f2e29","time_created":"2023-10-16T20:44:49.184976Z","time_modified":"2023-10-16T20:44:49.184976Z"},"dataset_id":"4b7a7052-f8e8-4196-8d6b-315943986ce6","volume_id":"1f86e688-937a-4e58-81f2-a30d697e92c0","block_size":512,"blocks_per_extent":131072,"extent_count":160}],[{"identity":{"id":"864fdb33-a36d-4bb0-96e6-a3b3b057ba8b","time_created":"2023-09-02T13:09:09.879256Z","time_modified":"2023-09-02T13:09:09.879256Z"},"time_deleted":null,"rcgen":1,"pool_id":"f59674a4-437e-4846-9abf-a71679082c82","ip":"fd00:1122:3344:11b::7","port":32345,"kind":"Crucible","size_used":21474836480},{"identity":{"id":"6a84deca-8b0f-475d-bbe2-1c684bfc0ec1","time_created":"2023-10-16T20:44:49.184976Z","time_modified":"2023-10-16T20:44:49.184976Z"},"dataset_id":"864fdb33-a36d-4bb0-96e6-a3b3b057ba8b","volume_id":"1f86e688-937a-4e58-81f2-a30d697e92c0","block_size":512,"blocks_per_extent":131072,"extent_count":160}]],"datasets_and_snapshots":[]}} | 864fdb33-a36d-4bb0-96e6-a3b3b057ba8b | 2023-09-02 13:09:09.879256+00 | 2023-09-02 13:09:09.879256+00 | NULL         |     1 | f59674a4-437e-4846-9abf-a71679082c82 | fd00:1122:3344:11b::7 | 32345 | crucible | 21474836480 | NULL       | NULL      | NULL        | NULL          |              NULL

One of the related datasets points to a sled that had been physically removed from the rack: fd00:1122:3344:11b::7. The regions returned from the query will then eventually reach delete_crucible_region which will attempt to delete the region in a retry_until_known_result loop. Given that the sled is no longer reachable, this saga gets stuck in a running state indefinitely.

Given that the above query looks for regions that have been freed for any reason (as opposed to actions that directly occurred in the containing saga), all future disk_delete saga will also find these regions and attempt to delete them. This essentially means that once a single disk delete hangs, all future delete operations will hang.

For the instance that we saw this, this ultimately was a result of data being leftover in CockroachDB when trying to clean up from removing sled 10.

@askfongjojo askfongjojo added this to the 4 milestone Oct 25, 2023
@morlandi7 morlandi7 modified the milestones: 4, 5 Nov 14, 2023
@morlandi7 morlandi7 modified the milestones: 5, 6 Nov 30, 2023
@morlandi7 morlandi7 modified the milestones: 6, 7 Jan 29, 2024
@morlandi7 morlandi7 modified the milestones: 7, 8 Mar 12, 2024
@davepacheco davepacheco changed the title Disk delete hangs can cause future disk deletes to hang all disk deletes hang while Crucible downstairs is unreachable Apr 17, 2024
@davepacheco
Copy link
Collaborator

I updated the synopsis here to reflect that it's not the first hang that causes subsequent hangs -- rather, all disk deletes appear to hang as long as any Crucible downstairs is unreachable (from my read of the description).

@askfongjojo askfongjojo pinned this issue Apr 29, 2024
jmpesp added a commit to jmpesp/omicron that referenced this issue May 1, 2024
When a disk is expunged, any region that was on that disk is assumed to
be gone. A single disk expungement can put many Volumes into degraded
states, as one of the three mirrors of a region set is now gone. Volumes
that are degraded in this way remain degraded until a new region is
swapped in, and the Upstairs performs the necessary repair operation
(either through a Live Repair or Reconciliation). Nexus can only
initiate these repairs - it does not participate in them, instead
requesting that a Crucible Upstairs perform the repair.

These repair operations can only be done by an Upstairs running as part
of an activated Volume: either Nexus has to send this Volume to a Pantry
and repair it there, or Nexus has to talk to a propolis that has that
active Volume. Further complicating things is that the Volumes in
question can be activated and deactivated as a result of user action,
namely starting and stopping Instances. This will interrupt any on-going
repair. This is ok! Both operations support being interrupted, but as a
result it's then Nexus' job to continually monitor these repair
operations and initiate further operations if the current one is
interrupted.

Nexus starts by creating region replacement requests, either manually or
as a result of disk expungement. These region replacement requests go
through the following states:

        Requested   <--
                      |
            |         |
            v         |
                      |
        Allocating  --

            |
            v

         Running    <--
                      |
            |         |
            v         |
                      |
         Driving    --

            |
            v

     ReplacementDone  <--
                        |
            |           |
            v           |
                        |
        Completing    --

            |
            v

        Completed

A single saga invocation is not enough to continually make sure a Volume
is being repaired, so region replacement is structured as series of
background tasks and saga invocations from those background tasks.

Here's a high level summary:

- a `region replacement` background task:

  - looks for disks that have been expunged and inserts region
    replacement requests into CRDB with state `Requested`

  - looks for all region replacemnt requests in state `Requested`
    (picking up new requests and requests that failed to transition to
    `Running`), and invokes a `region replacement start` saga.

- the `region replacement start` saga:

  - transitions the request to state `Allocating`, blocking out other
    invocations of the same saga

  - allocates a new replacement region

  - alters the Volume Construction Request by swapping out the old
    region for the replacement one

  - transitions the request to state `Running`

  - any unwind will transition the request back to the `Requested`
    state.

- a `region replacement drive` background task:

  - looks for requests with state `Running`, and invokes the `region
    replacement drive` saga for those requests

  - looks for requests with state `ReplacementDone`, and invokes the
    `region replacement finish` saga for those requests

- the `region replacement drive` saga will:

  - transition a request to state `Driving`, again blocking out other
    invocations of the same saga

  - check if Nexus has taken an action to initiate a repair yet. if not,
    then one is needed. if it _has_ previously initiated a repair
    operation, the state of the system is examined: is that operation
    still running? has something changed? further action may be required
    depending on this observation.

  - if an action is required, Nexus will prepare an action that will
    initiate either Live Repair or Reconciliation based on the current
    observed state of the system.

  - that action is then executed. if there was an error, then the saga
    unwinds. if it was successful, it is recorded as a "repair step" in
    CRDB and will be checked the next time the saga runs.

  - if Nexus observed an Upstairs telling it that a repair was completed
    or not necessary, then the request is placed into the
    `ReplacementDone` state, otherwise it is placed back into the
    `Running` state. if the saga unwinds, it unwinds back to the
    `Running` state.

- finally, the `region replacement finish` saga will:

  - transition a request into `Completing`

  - delete the old region by deleting a transient Volume that refers to
    it (in the case where a sled or disk is actually physically gone,
    expunging that will trigger oxidecomputer#4331, which needs to be fixed!)

  - transition the request to the `Complete` state

More detailed documentation is provided in each of the region
replacement saga's beginning docstrings.

Testing was done manually using the Canada region using the following
test cases:

- a disk needing repair is attached to a instance for the duration of
  the repair

- a disk needing repair is attached to a instance that is migrated
  mid-repair

- a disk needing repair is attached to a instance that is stopped
  mid-repair

- a disk needing repair is attached to a instance that is stopped
  mid-repair, then started in the middle of the pantry's repair

- a detached disk needs repair

- a detached disk needs repair, and is then attached to an instance that
  is then started

- a sled is expunged, causing region replacement requests for all
  regions on it

Fixes oxidecomputer#3886
Fixes oxidecomputer#5191
@morlandi7 morlandi7 modified the milestones: 8, 9 May 13, 2024
jmpesp added a commit to jmpesp/omicron that referenced this issue May 17, 2024
If there's a call to an external service, saga execution cannot move
forward until the result of that call is known, in the sense that Nexus
received a result. If there are transient problems, Nexus must retry
until a known result is returned.

This is problematic when the destination service is gone - Nexus will
retry indefinitely, halting the saga execution. Worse, in the case of
sagas calling the volume delete subsaga, subsequent calls will also
halt.

With the introduction of a physical disk policy, Nexus can know when to
stop retrying a call - the destination service is gone, so the known
result is an error.

This commit adds a `ProgenitorOperationRetry` object that takes an
operation to retry plus a "gone" check, and checks each retry iteration
if the destination is gone. If it is, then bail out, otherwise assume
that any errors seen are transient.

Further work is required to deprecate the `retry_until_known_result`
function, as retrying indefinitely is a bad pattern.

Fixes oxidecomputer#4331
Fixes oxidecomputer#5022
@jmpesp jmpesp linked a pull request May 17, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants