Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ceph v16.2.x Upgrade Warning #9185

Closed
galexrt opened this issue Nov 16, 2021 · 7 comments
Closed

Ceph v16.2.x Upgrade Warning #9185

galexrt opened this issue Nov 16, 2021 · 7 comments
Labels

Comments

@galexrt
Copy link
Member

galexrt commented Nov 16, 2021

Ceph has published a warning for upgrading from an "older version" to (any?) Pacific release, see https://docs.ceph.com/en/latest/releases/pacific/#v16-2-6-pacific.

image

Questions coming to my mind:

  • Rook Ceph Operator Questions
    • Does it even affect Rook Ceph users?
    • Should the operator set bluestore_fsck_quick_fix_on_mount to false in running Ceph clusters?
  • General questions for Ceph
    • When is it safe to upgrade to Ceph v16.2.x?
    • Is it safe to stay on Ceph v16.2.x till a fix is released in v16.2.7?

Anything else that should be answered regarding this "warning" from Ceph?

@galexrt galexrt added the bug label Nov 16, 2021
@obnoxxx
Copy link
Contributor

obnoxxx commented Nov 16, 2021

Briefly checking the ceph code and git log, this seems to be the situation:

  • 15.0 and older don't have the bluestore_fsck_quick_fix_on_mount option.
  • 15.1 introduced it and has it set to true
  • 15.2, 16.0, 16.1 all have it set to true
  • 16.2 changed the default to false

So since rook does not set bluestore_fsck_quick_fix_on_mount, it does not currently seem to be safe to upgrade to ceph 16.2 from a 15.1 or older release.

@BlaineEXE
Copy link
Member

BlaineEXE commented Nov 16, 2021

OMAP is used by Ceph under the hood for CephObjectStores and CephFilesystems. If users only use CephBlockPools (RBD), I don't see why they would need to worry about the issue.

I also don't see why users shouldn't feel safe to update if they are using the default value or if they set the value to false on their own. Unless there is an unknown way to trigger the bug, of course.

Obviously, if users are cautious, they should wait to upgrade, period. If they are eager, I think they can proceed after verifying that the bluestore_fsck_quick_fix_on_mount option is false (more on that below).


A user can find out if bluestore_fsck_quick_fix_on_mount has been set for their cluster by running ceph config dump. In the below example, I have set this to false globally and also set it to true for osd users to show how it can be set in multiple places.

$ ceph config dump
WHO        MASK  LEVEL     OPTION                               VALUE       RO
global           dev       bluestore_fsck_quick_fix_on_mount    false            # <-- FALSE
global           basic     log_to_file                          false
global           advanced  mon_allow_pool_delete                true
global           advanced  mon_allow_pool_size_one              true
global           advanced  mon_cluster_log_file
global           advanced  osd_scrub_auto_repair                true
global           advanced  rbd_default_features                 3
  mgr            advanced  mgr/balancer/active                  true
  mgr            advanced  mgr/balancer/mode                    upmap
  mgr            advanced  mgr/pg_autoscaler/autoscale_profile  scale-down
    mgr.a        advanced  mgr/dashboard/server_port            7000        *
    mgr.a        advanced  mgr/dashboard/ssl                    false       *
  osd            dev       bluestore_fsck_quick_fix_on_mount    true             # <-- TRUE

To set both of these false, I would run the below commands.

ceph config set global bluestore_fsck_quick_fix_on_mount false
ceph config set osd bluestore_fsck_quick_fix_on_mount false

If bluestore_fsck_quick_fix_on_mount is not present in ceph config dump, then it will take on the default value false during the upgrade, and no user action is needed.

BlaineEXE added a commit to BlaineEXE/rook that referenced this issue Nov 16, 2021
Ceph has recently reported that it may be unsafe to upgrade clusters
from Nautilus/Octopus to Pacific v16.2.0 through v16.2.6. We are
tracking this in Rook issue rook#9185.
Add a warning to the upgrade doc about this.

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
@BlaineEXE
Copy link
Member

BlaineEXE commented Nov 16, 2021

Users can also use ceph config-key dump and ceph config-key rm to list Ceph's config keys and remove ones related to bluestore_fsck_quick_fix_on_mount rather than setting the values to false.

$ ceph config-key dump
{
    "config-history/1/": "<<< binary blob of length 12 >>>",
    "config-history/10/": "<<< binary blob of length 12 >>>",
    "config-history/10/+mgr.a/mgr/dashboard/server_port": "7000",
    "config-history/11/": "<<< binary blob of length 12 >>>",
    "config-history/11/+global/mon_allow_pool_size_one": "true",
    "config-history/12/": "<<< binary blob of length 12 >>>",
    "config-history/12/+global/bluestore_fsck_quick_fix_on_mount": "false",
    "config-history/13/": "<<< binary blob of length 12 >>>",
    "config-history/13/+osd/bluestore_fsck_quick_fix_on_mount": "true",
    "config-history/2/": "<<< binary blob of length 12 >>>",
    "config-history/2/+global/mon_allow_pool_delete": "true",
    "config-history/3/": "<<< binary blob of length 12 >>>",
    "config-history/3/+global/mon_cluster_log_file": "",
    "config-history/4/": "<<< binary blob of length 12 >>>",
    "config-history/4/+global/osd_scrub_auto_repair": "true",
    "config-history/5/": "<<< binary blob of length 12 >>>",
    "config-history/5/+global/log_to_file": "false",
    "config-history/6/": "<<< binary blob of length 12 >>>",
    "config-history/6/+global/rbd_default_features": "3",
    "config-history/7/": "<<< binary blob of length 12 >>>",
    "config-history/7/+mgr/mgr/balancer/mode": "upmap",
    "config-history/8/": "<<< binary blob of length 12 >>>",
    "config-history/8/+mgr/mgr/balancer/active": "true",
    "config-history/9/": "<<< binary blob of length 12 >>>",
    "config-history/9/+mgr.a/mgr/dashboard/ssl": "false",
    "config/global/bluestore_fsck_quick_fix_on_mount": "false",       # <-- FALSE
    "config/global/log_to_file": "false",
    "config/global/mon_allow_pool_delete": "true",
    "config/global/mon_allow_pool_size_one": "true",
    "config/global/mon_cluster_log_file": "",
    "config/global/osd_scrub_auto_repair": "true",
    "config/global/rbd_default_features": "3",
    "config/mgr.a/mgr/dashboard/server_port": "7000",
    "config/mgr.a/mgr/dashboard/ssl": "false",
    "config/mgr/mgr/balancer/active": "true",
    "config/mgr/mgr/balancer/mode": "upmap",
    "config/mgr/mgr/pg_autoscaler/autoscale_profile": "scale-down",
    "config/osd/bluestore_fsck_quick_fix_on_mount": "true",           # <-- TRUE
    "mgr/dashboard/accessdb_v2": "{\"users\": {\"admin\": {\"username\": \"admin\", \"password\": \"$2b$12$eOFEk0ZsWlum9mt7hRiM9u31IsoKbNikUgf9oB4rjsHjZjRaTrVI2\", \"roles\": [\"administrator\"], \"name\": null, \"email\": null, \"lastUpdate\": 1637087695, \"enabled\": true, \"pwdExpirationDate\": null, \"pwdUpdateRequired\": false}}, \"roles\": {}, \"version\": 2}",
    "mgr/devicehealth/last_scrape": "20211116-183450",
    "mgr/progress/completed": "{\"events\": [{\"id\": \"a49c8915-5b2a-4289-acd5-22c4a0fa6092\", \"message\": \"Global Recovery Event\", \"refs\": [[\"global\", \"\"]], \"started_at\": 1637096112.1040528, \"finished_at\": 1637096127.1076558, \"add_to_ceph_s:\": true}, {\"id\": \"86b2dc2e-98d2-43eb-8951-21ba80557746\", \"message\": \"PG autoscaler increasing pool 1 PGs from 1 to 128\", \"refs\": [[\"pool\", 1]], \"started_at\": 1637096109.9868796, \"finished_at\": 1637096170.996654, \"add_to_ceph_s:\": false}, {\"id\": \"2533b085-83a4-42cc-a9e4-4c750a2eb941\", \"message\": \"PG autoscaler increasing pool 2 PGs from 32 to 128\", \"refs\": [[\"pool\", 2]], \"started_at\": 1637096110.9930346, \"finished_at\": 1637096170.9967546, \"add_to_ceph_s:\": false}], \"version\": 2, \"compat_version\": 2}",
    "mgr/telemetry/report_id": "453c5f51-5731-4d07-b746-d7d354d9ea1a",
    "mgr/telemetry/salt": "f9968219-b512-4ba7-a75f-081a973cf730"
}
[rook@rook-ceph-tools-555c879675-4nxsc /]$ ceph config-key rm config/global/bluestore_fsck_quick_fix_on_mount
key deleted
[rook@rook-ceph-tools-555c879675-4nxsc /]$ ceph config-key rm config/osd/bluestore_fsck_quick_fix_on_mount
key deleted

BlaineEXE added a commit to BlaineEXE/rook that referenced this issue Nov 17, 2021
Ceph has recently reported that it may be unsafe to upgrade clusters
from Nautilus/Octopus to Pacific v16.2.0 through v16.2.6. We are
tracking this in Rook issue rook#9185.
Add a warning to the upgrade doc about this.

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
@BlaineEXE
Copy link
Member

Also, users should ensure that no references to bluestore_fsck_quick_fix_on_mount are present in the
rook-config-override ConfigMap and Remove
them if they exist.

BlaineEXE added a commit to BlaineEXE/rook that referenced this issue Nov 18, 2021
Ceph has recently reported that it may be unsafe to upgrade clusters
from Nautilus/Octopus to Pacific v16.2.0 through v16.2.6. We are
tracking this in Rook issue rook#9185.
Add a warning to the upgrade doc about this.

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
mergify bot pushed a commit that referenced this issue Nov 18, 2021
Ceph has recently reported that it may be unsafe to upgrade clusters
from Nautilus/Octopus to Pacific v16.2.0 through v16.2.6. We are
tracking this in Rook issue #9185.
Add a warning to the upgrade doc about this.

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
(cherry picked from commit 1f87196)

# Conflicts:
#	Documentation/ceph-upgrade.md
mergify bot pushed a commit that referenced this issue Nov 18, 2021
Ceph has recently reported that it may be unsafe to upgrade clusters
from Nautilus/Octopus to Pacific v16.2.0 through v16.2.6. We are
tracking this in Rook issue #9185.
Add a warning to the upgrade doc about this.

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
(cherry picked from commit 1f87196)
BlaineEXE added a commit that referenced this issue Nov 18, 2021
Ceph has recently reported that it may be unsafe to upgrade clusters
from Nautilus/Octopus to Pacific v16.2.0 through v16.2.6. We are
tracking this in Rook issue #9185.
Add a warning to the upgrade doc about this.

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
(cherry picked from commit 1f87196)
@leseb
Copy link
Member

leseb commented Nov 22, 2021

Why not refuse the upgrade if Rook sees this option enabled? Rook can detect the current version and the newly applied one. So users risk data loss why not block the upgrade in Rook?

Is this closed in #9187 or is more work planned?

@travisn
Copy link
Member

travisn commented Nov 22, 2021

Since it's so rare that this option would remain enabled during a pacific upgrade, my assumption is that documentation would be sufficient. Since 16.2.7 is almost out as well and people tend to update to the latest dot release anyway, it didn't seem worth a code fix like that to prevent the upgrade. But i'm open to that check as well if someone wants to implement it.

@BlaineEXE
Copy link
Member

I think we can consider this closed by #9187. I'll close this now, and we can reopen if people think it's useful to keep open or if there are more changes needed.

subhamkrai pushed a commit to subhamkrai/rook that referenced this issue Jan 5, 2022
Ceph has recently reported that it may be unsafe to upgrade clusters
from Nautilus/Octopus to Pacific v16.2.0 through v16.2.6. We are
tracking this in Rook issue rook#9185.
Add a warning to the upgrade doc about this.

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
(cherry picked from commit 1f87196)
parth-gr pushed a commit to parth-gr/rook that referenced this issue Feb 22, 2022
Ceph has recently reported that it may be unsafe to upgrade clusters
from Nautilus/Octopus to Pacific v16.2.0 through v16.2.6. We are
tracking this in Rook issue rook#9185.
Add a warning to the upgrade doc about this.

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
parth-gr pushed a commit to parth-gr/rook that referenced this issue Feb 22, 2022
Ceph has recently reported that it may be unsafe to upgrade clusters
from Nautilus/Octopus to Pacific v16.2.0 through v16.2.6. We are
tracking this in Rook issue rook#9185.
Add a warning to the upgrade doc about this.

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants