New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a previously-expunged sled leaves it expunged #5625
Milestone
Comments
Would implementing #5554 fix this? |
Probably, with at least two caveats:
|
This was referenced Apr 26, 2024
jgallagher
added a commit
that referenced
this issue
May 1, 2024
I ran into this as a part of #5625, where adding a previously-expunged sled appeared to succeed, but didn't actually add anything new. Today if we try to add a sled that is already _running_, we get a 500, because Nexus fails when it tries to tell the sled-agent to start. But with this PR, we fail earlier: adding a sled that already has a subnet allocation fails before we even try to talk to the sled-agent, because it means someone has already added this sled: ``` root@oxz_switch:~# omdb -w nexus sleds add g2 i86pc added sled g2 (i86pc): 90413e40-8139-43b4-9081-365dab6e5579 root@oxz_switch:~# omdb -w nexus sleds add g2 i86pc Error: adding sled Caused by: Error Response: status: 400 Bad Request; headers: {"content-type": "application/json", "x-request-id": "9eb95a9f-3fe0-4f75-8846-13490b95500e", "content-length": "188", "date": "Tue, 30 Apr 2024 20:54:49 GMT"}; value: Error { error_code: Some("ObjectAlreadyExists"), message: "already exists: sled \"g2 / i86pc (90413e40-8139-43b4-9081-365dab6e5579)\"", request_id: "9eb95a9f-3fe0-4f75-8846-13490b95500e" } ``` This does change the external API slightly (204 -> 201 created, and we now return the ID), but I think (?) that's probably fine since we have no real consumers of that yet.
jgallagher
added a commit
that referenced
this issue
May 13, 2024
This adds the `sled_state` map described in RFD 457 to `Blueprint`. The motivating driver here is decommissioning expunged sleds, which is on the path to addressing #5625, but this PR only adds the field - the planner carries forward states from its inputs, but otherwise makes no changes, and the executor doesn't read the new field at all.
jgallagher
added a commit
that referenced
this issue
May 13, 2024
This builds on (and is currently pointed at) #5663, and is on the path to fixing #5625, although there is still more work to do there. (Should be small, and I'll start on it while this is in review to make sure it really fixes it.) I don't plan on landing this before R8 is out the door, but wanted to go ahead and open it up to review. Much of this is fallout from our discussions about what it means to decommission a sled. Major changes are: * `SledFilter::All` has been renamed to `SledFilter::Commissioned`, and no longer returns sleds that are in `SledState::Decommissioned`. * The blueprint planner will now update the desired sled state to `Decommissioned` for sleds which satisfy our conditions. (See `do_plan_decommission()`.) * The blueprint planner will carry forward the Omicron zones of decommissioned sleds to child blueprints. Pruning these is #5552. * The blueprint planner will _not_ carry forward a desired sled state of `Decommissioned` once the inputs report that the sled has already been decommissioned. * The blueprint executor will decommission sleds that the planner said to. * Decommissioning a sled implicitly decommissions all its disks. (This matches what happens with sled expungement, and should not interfere with region replacement, which keys off of policy, not state.)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
On an a4x2 testbed, I went through this sequence:
omdb db reconfigurator-save
,reconfigurator-cli
, andomdb nexus blueprints import
to craft a blueprint that gives g2 a Nexus zone/pool/int/*
)a4x2 hyperstop g2
to shut off the sledomdb db nexus sleds expunge ...
to expunge the sleda4x2 hyperstart g2
to bring the sled back onlineAt this point the sled shows up as uninitialized:
And adding it appears to succeed:
However, after doing so, it still shows up in
list-uninitialized
:and we can even apparently add it again (again successfully):
but it remains in the
list-uninitialized
output.We expect this to work, and the add should create a new
sled_id
for this same baseboard we've seen before. But looking in CRDB, we only have one row for g2 (the one created when we added it the first time), and it is stillexpunged
:Tracing through the add sled flow, I think the problem is in
DataStore::allocate_sled_underlay_subnet_octets()
. This is the method that creates a new sled ID (line 309); however,hw_baseboard_id
is tied to the baseboard which hasn't changed. If we find a matching allocation on lines 315-318, we'll stop and return that prior allocation. In CRDB, we can see that we do have a prior allocation for this baseboard (as expected, since we've added this sled before!):I think (?) if
allocate_sled_underlay_subnet_octets()
finds a prior allocation with a matching baseboard, it should check the sled table for that allocation's sled_id, and if it's expunged, ignore that allocation and continue looping (resulting in it either finding a non-expunged match, or exiting the loop and returningnew_allocation
).The text was updated successfully, but these errors were encountered: