Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mds: Skip sanity check during upgrades to 16.2.7 #9418

Merged
merged 1 commit into from Dec 16, 2021

Conversation

travisn
Copy link
Member

@travisn travisn commented Dec 14, 2021

Description of your changes:
The mons need to be configured to skip the mds sanity check during upgrade to 16.2.7 or the mons may crash during that upgrade.

Which issue is resolved by this Pull Request:
Resolves #9373

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: Add the flag for skipping the build if this is only a documentation change. See here for the flag.
  • Skip Unrelated Tests: Add a flag to run tests for a specific storage provider. See test options.
  • Reviewed the developer guide on Submitting a Pull Request
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.
  • Pending release notes updated with breaking and/or notable changes, if necessary.
  • Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • Code generation (make codegen) has been run to update object specifications, if necessary.

@@ -92,6 +92,13 @@ func (c *cluster) reconcileCephDaemons(rookImage string, cephVersion cephver.Cep
}
c.ClusterInfo.SetName(c.namespacedName.Name)

// Execute actions before the monitors are up and running
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unclear, how can we run commands against the mon store if the mons are not up and running? :)
So this should happen before the mons are updated to 16.2.8, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These actions are only intended for upgrade or at least an existing cluster, i'll clarify the comment.

// If there is an error, just print it and continue. Likely there is not a
// negative consequence of continuing since several complex conditions must exist to hit
// the upgrade issue where the sanity checks need to be disabled.
logger.Warningf("failed to disable the mon_mds_skip_sanity. %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be an error? If this fails and the mons upgrade they will fail too right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still haven't been able to repro. Since it only affects some clusters and there is a workaround if they do hit it, this doesn't seem critical to fail the reconcile.

func (c *cluster) preMonStartupActions(cephVersion cephver.CephVersion) error {
// Disable the mds sanity checks for the mons due to a ceph upgrade issue
// for the mds to Pacific if 16.2.7 or greater.
if c.isUpgrade && cephVersion.IsPacific() && cephVersion.IsAtLeast(cephver.CephVersion{Major: 16, Minor: 2, Extra: 7}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "not 16.2.8", so that this is only true for the upgrade from 16.2.7 to 16.2.8? Or is this valid for all the following upgrades?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upgrade issue can happen if we are upgrading ceph to 16.2.7. For future pacific upgrades we don't really need the setting applied, but to keep the logic simple it will also apply the setting for other pacific upgrades in the future as well. I don't want to restrict the check because if they don't upgrade exactly to 16.2.7 we might miss the setting. For example, if they upgrade from 16.2.4 directly to 16.2.9 in the future, we still would need to apply the setting or risk the upgrade issue.

@@ -464,6 +501,15 @@ func (c *cluster) postMonStartupActions() error {
return errors.Wrap(err, "failed to enable Ceph messenger version 2")
}

// Always ensure the skip mds sanity checks setting is cleared, for all Pacific deployments
if c.ClusterInfo.CephVersion.IsPacific() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be only for "is 16.2.8"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to clear it at the end of the upgrade. After the mons are restarted on 16.2.7, there is no need for the setting anymore. But since they might have skipped 16.2.7 and directly to another patch release, it's made more general and doesn't hurt to always do it for pacific.

The mons need to be configured to skip the mds sanity check
during upgrade to 16.2.7 or the mons may crash during that
upgrade.

Signed-off-by: Travis Nielsen <tnielsen@redhat.com>
Copy link
Member

@BlaineEXE BlaineEXE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay to me. I also have some concerns like here https://github.com/rook/rook/pull/9418/files#r769383135. I'm not sure I have a strong opinion about what the best behavior is, but having an attempt to fix seems better than not.

Have any users checked this?

@travisn
Copy link
Member Author

travisn commented Dec 16, 2021

Looks okay to me. I also have some concerns like here https://github.com/rook/rook/pull/9418/files#r769383135. I'm not sure I have a strong opinion about what the best behavior is, but having an attempt to fix seems better than not.

Attempting which fix? You mean we should fail the reconcile? Since there is another workaround I'm not seeing that as necessary to avoid the risk of failing a reconcile when the cluster wouldn't hit the upgrade issue anyway most of the time.

Have any users checked this?

Are you asking if they have tried the workaround? Yes, applying the workaround described in #9373 worked for them.

Copy link
Member

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move forward and perhaps revisit this condition later https://github.com/rook/rook/pull/9418/files#r769383135

@travisn travisn merged commit b7a145e into rook:master Dec 16, 2021
@travisn travisn deleted the mds-skip-sanity branch December 16, 2021 16:53
travisn added a commit that referenced this pull request Dec 16, 2021
mds: Skip sanity check during upgrades to 16.2.7 (backport #9418)
travisn added a commit that referenced this pull request Dec 16, 2021
mds: Skip sanity check during upgrades to 16.2.7 (backport #9418)
// In a running cluster disable the mds skip sanity setting during upgrades.
monStore := config.GetMonStore(c.context, c.ClusterInfo)
if skip {
if err := monStore.Set("mon", "mon_mds_skip_sanity", "1"); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change the ceph.conf of the container running the monitor or is it through ceph config set mon ....?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @leseb

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@batrick The monStore.Set() translates into the ceph config set mon command.

Rook actually does not use the ceph.conf overrides by default, they are just reserved for custom overrides not touched by the operator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@travisn this problem is actually different. (The original sanity check config was to protect the older mons from changes the new mons make. Not vice-versa.) I think I'll need to write a fix for the mons.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade to Ceph v16.2.7 may need to set mon_mds_skip_sanity
4 participants