Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mds downgrade from higher version to lower version #8576

Closed
fengjiankui121 opened this issue Aug 23, 2021 · 13 comments
Closed

mds downgrade from higher version to lower version #8576

fengjiankui121 opened this issue Aug 23, 2021 · 13 comments
Assignees

Comments

@fengjiankui121
Copy link
Contributor

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:

Expected behavior:
not allowed mds downgrade

How to reproduce it (minimal and precise):

step 1: install ceph with higher version
step 2: upgrade ceph with lower version
step 3: Wait for the osd deployment to complete, restart the operator
step 4: mds will downgrade from higher version to lower version

File(s) to submit:

  • Cluster CR (custom resource), typically called cluster.yaml, if necessary
  • Operator's logs, if necessary
  • Crashing pod(s) logs, if necessary

To get logs, use kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
Read Github documentation if you need help.

Environment:

  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Cloud provider or hardware configuration:
  • Rook version (use rook version inside of a Rook Pod):
  • Storage backend version (e.g. for ceph do ceph -v):
  • Kubernetes version (use kubectl version):
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):
@travisn
Copy link
Member

travisn commented Aug 23, 2021

@fengjiankui121 If you don't restart the operator do you see the same behavior? The operator intends to prevent the ceph downgrades, but there may be an issue when the operator is restarted.

@fengjiankui121
Copy link
Contributor Author

@travisn The same phenomenon will occur if other operations cause the operator to re-run. For example, cephcluster configuration changes

@fengjiankui121
Copy link
Contributor Author

@travisn This problem is mainly due to the inconsistency of the version of the ceph component, which triggers the upgrade of mds. The relevant code is as follows:

rook\pkg\operator\ceph\cluster\version.go

if numberOfCephVersions > 1 {
// let's return immediately
logger.Warningf("it looks like we have more than one ceph version running. triggering upgrade. %+v:", runningVersions.Overall)
return true, nil
}

@fengjiankui121
Copy link
Contributor Author

@travisn OSD, monitor, etc. are allowed to upgrade to lower versions, but mds is not allowed, which will lead to inconsistent versions of ceph components, and inconsistent versions of ceph components will trigger the upgrade of mds

@travisn
Copy link
Member

travisn commented Sep 3, 2021

@fengjiankui121 To be clear, are you seeing that the mds is not updated during a downgraded Ceph version? But if you restart the operator, the mds is updated to the downgraded version? And the bug is that you expect the mds to downgrade without the operator restart?

@fengjiankui121
Copy link
Contributor Author

@travisn yes, it is

@travisn
Copy link
Member

travisn commented Sep 14, 2021

Ok, sounds like we just need to relax the check and allow the mds downgrade. It's not really supported to downgrade, but the reality is that sometimes it is better to risk the downgrade than to stay in a broken state after an upgrade

@sp98 sp98 self-assigned this Sep 15, 2021
@sp98
Copy link
Contributor

sp98 commented Sep 15, 2021

@fengjiankui121 Need help the reproduce this behavior. Steps I followed are below:

  • Deploy rook-ceph (master branch) with ceph 16.2.5.
  • Create cluster/examples/kubernetes/ceph/filesystem-test.yaml
  • Downgrade ceph version to image: quay.io/ceph/ceph:v16.2.4 in the cephCluster yaml
  • Wait for downgrade to complete.

Before downgrade:

 versions:
      mds:
        ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 2
      mgr:
        ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 1
      mon:
        ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 3
      osd:
        ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 1
      overall:
        ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 7

After downgrade :

    versions:
      mds:
        ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable): 2
      mgr:
        ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable): 1
      mon:
        ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable): 3
      osd:
        ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable): 1
      overall:
        ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable): 7

mds got downgraded to 16.2.4 just by downgrading the ceph version from 16.2.5 to 16.2.4 in the cephCluster yaml

Let me know if I'm missing something.

@fengjiankui121
Copy link
Contributor Author

From the rook code, it can be found that mds is not allowed to be upgraded, unless the versions between the ceph components are inconsistent, the relevant code is as follows:

if numberOfCephVersions > 1 {
// let's return immediately
logger.Warningf("it looks like we have more than one ceph version running. triggering upgrade. %+v:", runningVersions.Overall)
return true, nil
}

......
......

if cephver.IsInferior(imageSpecVersion, clusterRunningVersion) {
return true, errors.Errorf("image spec version %s is lower than the running cluster version %s, downgrading is not supported", imageSpecVersion.String(), clusterRunningVersion.String())}

@fengjiankui121
Copy link
Contributor Author

@sp98 From the rook code, it can be found that mds is not allowed to be upgraded, unless the versions between the ceph components are inconsistent, the relevant code is as follows:

if numberOfCephVersions > 1 {
// let's return immediately
logger.Warningf("it looks like we have more than one ceph version running. triggering upgrade. %+v:", runningVersions.Overall)
return true, nil
}

......
......

if cephver.IsInferior(imageSpecVersion, clusterRunningVersion) {
return true, errors.Errorf("image spec version %s is lower than the running cluster version %s, downgrading is not supported", imageSpecVersion.String(), clusterRunningVersion.String())}

@sp98
Copy link
Contributor

sp98 commented Sep 16, 2021

Able to reproduce this issue.

  versions:
      mds:
        ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 2
      mgr:
        ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable): 1
      mon:
        ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable): 3
      osd:
        ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable): 1
      overall:
        ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable): 5
        ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 2

Discussed about this issue in the huddle. General consensus is to allow users to downgrade for scenarios where upgrades could lead to a broken state . See comment

Rook ( and also ceph) does not support downgrades.

Waiting for a major refactoring PR. I'll test this issue again after that PR is merged.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@travisn
Copy link
Member

travisn commented Nov 15, 2021

This was actually fixed by #9098 in v1.7.7

@travisn travisn closed this as completed Nov 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants