New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
manage cockroachdb cluster version with blueprints #5603
Conversation
Labeling for Release 9 since this is relatively high-risk and not worth slipping the schedule I think. This will slip CockroachDB 22.2 to R10. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great. I have some nits here and some larger things but this definitely feels like the right direction!
Do you have thoughts on whether it makes sense to keep the Blueprint implementation as-is (we add a column specifically for the value we want to set the The latter makes more sense to me, although might be overkill. But I know when we are able to move beyond v23.2, which provides a different mechanism for performing upgrades, we'll instead target the |
Hmm. When you first mentioned it, I thought the separate table made sense. But if we do that and have a separate table of |
That makes sense. And, yeah, I don't really expect setting cluster options to be something we regularly do outside of performing upgrades. I think I will change the column definition to more clearly reflect that it is a 1:1 mapping to that particular setting, and that |
Hmm. Something else we need to think about: if a rack is initialized on or after a "tick" release and before its corresponding "tock" release, the cluster is created with the major version after the one we specify by policy. The policy will never be effective in that state. We also can't downgrade that particular rack to an older version of CockroachDB in a future Oxide software release. Idly wondering how ridiculous it would be to have RSS spin up a CockroachDB cluster with an older zone, set (edit) I suppose that's a problem for the release when we actually want to upgrade the CockroachDB software though. This particular change has to land in a release before then :) |
I figured out a way to, in a single statement, verify some basic cluster state against a caller-provided token and prevent modifying settings if the state doesn't match. It uses a combination of two versions, the current cluster version and the major version of the CockroachDB node we're currently talking to. (Since that will give us some pain when we start doing rolling upgrades of CockroachDB nodes, I should move that part of execution from the beginning to the end.) Other things I still need to do:
|
I've written up a file in Still need to write tests of the blueprint planning and execution bits. |
Marking ready for review; the only thing that's left is to update the display/diff output of blueprints with this data, but @andrewjstone tells me he's in the middle of refactoring all that, so I think it's maybe fine to do it in a follow-up PR. |
/// | ||
/// This fingerprint should return a STRING value. It is safe to modify how this | ||
/// fingerprint is calculated between Nexus releases; the stale fingerprint in | ||
/// the previous blueprint will be rejected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excellent comment
policy | ||
} else { | ||
// Ensure `cluster.preserve_downgrade_option` is reset | ||
"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the answer to this is "no" but wanted to ask just in case - should we check here for "policy
is older than version
"? If I understand correctly, that should be impossible (that means we think we want to run a CRDB major version that is behind the current cluster version), so I'm guessing it's not worth checking for, since such a situation would blow up long before we got here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the conversations in #oxide-update it sounds like we want to use the policy mechanism to hold back cluster versions as we upgrade CockroachDB, but to use the newer cluster version for any newly-initialized racks. In that case, "policy
is older than version
" is something that will indeed happen. As part of converting the value of cockroachdb_settings_preserve_downgrade
to an enum, I was able to make this distinction a lot more clear and handle this case.
Good news: this all works!
Question I have now that I've tested deploying this: since we currently do not enable execution of blueprints, should RSS do anything special when it creates the initial blueprint, and/or should RSS separately try to set the |
Nice! 🎉
I agree it doesn't matter for systems like madrid, but if we were setting up a new production rack today, would we want |
That indeed seems like the right place to do it. I'll add that. |
7e1486d
to
aa252b2
Compare
Okay, I think this has approached a state where it's ready for final review. |
/// Whether to set `cluster.preserve_downgrade_option` and what to set it to | ||
/// (`None` if this value was retrieved from the database and was invalid) | ||
pub cockroachdb_setting_preserve_downgrade: | ||
Option<CockroachDbPreserveDowngrade>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit, feel free to ignore - do you think this would be clearer as a Result<CockroachDbPreserveDowngrade, _>
where the error case indicates "something is wrong in the DB" and probably holds the actual string we couldn't deserialize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so a Result<CockroachDbPreserveDowngrade, String>
, and then handle that in the blueprint diff spots? idk if that would be clearer but it might be nicer regardless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, or maybe a newtype around a string (for clarity of what that string is). I don't feel strongly about this either way.
AllowUpgrade, | ||
/// Ensure the setting is set to a given cluster version. | ||
Set(CockroachDbClusterVersion), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for changing this; I think it's a nice improvement.
Sigh, |
4a73ca2
to
0871b51
Compare
To upgrade CockroachDB, we'll need to manage the
cluster.preserve_downgrade_option
cluster setting to give ourselves the opportunity to roll back an upgrade. I was initially planning to manage this with database migrations, butSET CLUSTER SETTING
cannot be run as part of a multi-statement transaction.In the limit, the Reconfigurator will need to do this anyway as it performs rolling upgrades of CockroachDB nodes, so we may as well teach it to manage cluster settings today.