New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mds: Skip sanity check during upgrades to 16.2.7 #9418
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -92,6 +92,14 @@ func (c *cluster) reconcileCephDaemons(rookImage string, cephVersion cephver.Cep | |
} | ||
c.ClusterInfo.SetName(c.namespacedName.Name) | ||
|
||
// Execute actions before the monitors are up and running, if needed during upgrades. | ||
// These actions would be skipped in a new cluster. | ||
logger.Debug("monitors are about to reconcile, executing pre actions") | ||
err = c.preMonStartupActions(cephVersion) | ||
if err != nil { | ||
return errors.Wrap(err, "failed to execute actions before reconciling the ceph monitors") | ||
} | ||
|
||
// Start the mon pods | ||
controller.UpdateCondition(c.ClusterInfo.Context, c.context, c.namespacedName, cephv1.ConditionProgressing, v1.ConditionTrue, cephv1.ClusterProgressingReason, "Configuring Ceph Mons") | ||
clusterInfo, err := c.mons.Start(c.ClusterInfo, rookImage, cephVersion, *c.Spec) | ||
|
@@ -443,6 +451,37 @@ func (c *cluster) replaceDefaultCrushMap(newRoot string) (err error) { | |
return nil | ||
} | ||
|
||
// preMonStartupActions is a collection of actions to run before the monitors are reconciled. | ||
func (c *cluster) preMonStartupActions(cephVersion cephver.CephVersion) error { | ||
// Disable the mds sanity checks for the mons due to a ceph upgrade issue | ||
// for the mds to Pacific if 16.2.7 or greater. We keep it more general for any | ||
// Pacific upgrade greater than 16.2.7 in case they skip updrading directly to 16.2.7. | ||
if c.isUpgrade && cephVersion.IsPacific() && cephVersion.IsAtLeast(cephver.CephVersion{Major: 16, Minor: 2, Extra: 7}) { | ||
if err := c.skipMDSSanityChecks(true); err != nil { | ||
// If there is an error, just print it and continue. Likely there is not a | ||
// negative consequence of continuing since several complex conditions must exist to hit | ||
// the upgrade issue where the sanity checks need to be disabled. | ||
logger.Warningf("failed to disable the mon_mds_skip_sanity. %v", err) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't this be an error? If this fails and the mons upgrade they will fail too right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I still haven't been able to repro. Since it only affects some clusters and there is a workaround if they do hit it, this doesn't seem critical to fail the reconcile. |
||
} | ||
} | ||
return nil | ||
} | ||
|
||
func (c *cluster) skipMDSSanityChecks(skip bool) error { | ||
// In a running cluster disable the mds skip sanity setting during upgrades. | ||
monStore := config.GetMonStore(c.context, c.ClusterInfo) | ||
if skip { | ||
if err := monStore.Set("mon", "mon_mds_skip_sanity", "1"); err != nil { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this change the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc @leseb There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @batrick The Rook actually does not use the ceph.conf overrides by default, they are just reserved for custom overrides not touched by the operator. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @travisn this problem is actually different. (The original sanity check config was to protect the older mons from changes the new mons make. Not vice-versa.) I think I'll need to write a fix for the mons. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
return err | ||
} | ||
} else { | ||
if err := monStore.Delete("mon", "mon_mds_skip_sanity"); err != nil { | ||
return err | ||
} | ||
} | ||
return nil | ||
} | ||
|
||
// postMonStartupActions is a collection of actions to run once the monitors are up and running | ||
// It gets executed right after the main mon Start() method | ||
// Basically, it is executed between the monitors and the manager sequence | ||
|
@@ -464,6 +503,15 @@ func (c *cluster) postMonStartupActions() error { | |
return errors.Wrap(err, "failed to enable Ceph messenger version 2") | ||
} | ||
|
||
// Always ensure the skip mds sanity checks setting is cleared, for all Pacific deployments | ||
if c.ClusterInfo.CephVersion.IsPacific() { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't this be only for "is 16.2.8"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We want to clear it at the end of the upgrade. After the mons are restarted on 16.2.7, there is no need for the setting anymore. But since they might have skipped 16.2.7 and directly to another patch release, it's made more general and doesn't hurt to always do it for pacific. |
||
if err := c.skipMDSSanityChecks(false); err != nil { | ||
// If there is an error, just print it and continue. We can just try again | ||
// at the next reconcile. | ||
logger.Warningf("failed to re-enable the mon_mds_skip_sanity. %v", err) | ||
} | ||
} | ||
|
||
crushRoot := client.GetCrushRootFromSpec(c.Spec) | ||
if crushRoot != "default" { | ||
// Remove the root=default and replicated_rule which are created by | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be "not 16.2.8", so that this is only true for the upgrade from 16.2.7 to 16.2.8? Or is this valid for all the following upgrades?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The upgrade issue can happen if we are upgrading ceph to 16.2.7. For future pacific upgrades we don't really need the setting applied, but to keep the logic simple it will also apply the setting for other pacific upgrades in the future as well. I don't want to restrict the check because if they don't upgrade exactly to 16.2.7 we might miss the setting. For example, if they upgrade from 16.2.4 directly to 16.2.9 in the future, we still would need to apply the setting or risk the upgrade issue.