Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: check if osd is safe-to-destroy before removal #9230

Merged
merged 1 commit into from Nov 25, 2021

Conversation

leseb
Copy link
Member

@leseb leseb commented Nov 23, 2021

Description of your changes:

If multiple removal jobs are fired in parallel, there is a risk of
losing data since we will forcefully remove the OSD. So now, we check
if the OSD is ok-to-stop first and then proceed. The code waits forever
and retries every minute.

Signed-off-by: Sébastien Han seb@redhat.com

Which issue is resolved by this Pull Request:
Resolves #

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: Add the flag for skipping the build if this is only a documentation change. See here for the flag.
  • Skip Unrelated Tests: Add a flag to run tests for a specific storage provider. See test options.
  • Reviewed the developer guide on Submitting a Pull Request
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.
  • Pending release notes updated with breaking and/or notable changes, if necessary.
  • Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • Code generation (make codegen) has been run to update object specifications, if necessary.

pkg/daemon/ceph/osd/remove.go Outdated Show resolved Hide resolved
pkg/daemon/ceph/osd/remove.go Outdated Show resolved Hide resolved
pkg/daemon/ceph/osd/remove.go Show resolved Hide resolved
pkg/daemon/ceph/osd/remove.go Outdated Show resolved Hide resolved
pkg/daemon/ceph/osd/remove.go Outdated Show resolved Hide resolved
.github/workflows/canary-integration-test.yml Outdated Show resolved Hide resolved
.github/workflows/canary-integration-test.yml Show resolved Hide resolved
cmd/rook/ceph/osd.go Outdated Show resolved Hide resolved
@leseb leseb force-pushed the osd-rm-check branch 2 times, most recently from 76f516b to 0051fc0 Compare November 24, 2021 15:00
@leseb leseb requested a review from travisn November 24, 2021 15:03
Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nits, if you feel like changing them

pkg/daemon/ceph/osd/remove.go Outdated Show resolved Hide resolved
pkg/daemon/ceph/osd/remove.go Show resolved Hide resolved
pkg/daemon/ceph/osd/remove.go Show resolved Hide resolved
pkg/daemon/ceph/osd/remove.go Show resolved Hide resolved
@leseb leseb force-pushed the osd-rm-check branch 7 times, most recently from 62fc07a to 0b3ec33 Compare November 25, 2021 09:00
@leseb
Copy link
Member Author

leseb commented Nov 25, 2021

Example run:

2021-11-25 09:13:03.096304 I | rookcmd: starting Rook v1.7.0-alpha.0.666.g7b16871b9 with arguments '/usr/local/bin/rook ceph osd remove --preserve-pvc true --force-osd-removal true --osd-ids 1'
2021-11-25 09:13:03.096585 I | rookcmd: flag values: --force-osd-removal=true, --help=false, --log-level=DEBUG, --operator-image=, --osd-ids=1, --preserve-pvc=true, --service-account=
2021-11-25 09:13:03.096596 I | op-mon: parsing mon endpoints: a=10.102.53.134:6789
2021-11-25 09:13:03.120734 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2021-11-25 09:13:03.120847 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2021-11-25 09:13:03.120911 D | cephclient: config file @ /etc/ceph/ceph.conf: [global]
fsid                           = 2c3e8b79-2e20-42b2-b403-6d3f897b6052
mon initial members            = a
mon host                       = [v2:10.102.53.134:3300,v1:10.102.53.134:6789]
osd_pool_default_size          = 1
mon_warn_on_pool_no_redundancy = false
bdev_flock_retry               = 20
bluefs_buffered_io             = false

[client.admin]
keyring = /var/lib/rook/rook-ceph/client.admin.keyring

2021-11-25 09:13:03.120936 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2021-11-25 09:13:03.533555 I | cephosd: validating status of osd.1
2021-11-25 09:13:03.533582 I | cephosd: osd.1 is marked 'DOWN'
2021-11-25 09:13:03.533598 D | exec: Running command: ceph osd safe-to-destroy 1 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2021-11-25 09:13:03.867580 I | cephosd: osd.1 is NOT be ok to destroy but force removal is enabled so proceeding with removal
2021-11-25 09:13:03.867618 D | exec: Running command: ceph osd find 1 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2021-11-25 09:13:04.259405 I | cephosd: marking osd.1 out
2021-11-25 09:13:04.259436 D | exec: Running command: ceph osd out osd.1 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2021-11-25 09:13:05.448672 E | cephosd: failed to fetch the deployment "rook-ceph-osd-1". deployments.apps "rook-ceph-osd-1" not found
2021-11-25 09:13:05.448868 I | cephosd: destroying osd.1
2021-11-25 09:13:05.448994 D | exec: Running command: ceph osd destroy osd.1 --yes-i-really-mean-it --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2021-11-25 09:13:05.791449 I | cephosd: removing osd.1 from ceph
2021-11-25 09:13:05.791488 D | exec: Running command: ceph osd crush rm fv-az120-714 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2021-11-25 09:13:06.290261 E | cephosd: failed to remove CRUSH host "fv-az120-714". exit status 39
2021-11-25 09:13:06.290293 D | exec: Running command: ceph crash ls --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2021-11-25 09:13:06.691913 I | cephosd: no ceph crash to silence
2021-11-25 09:13:06.691937 I | cephosd: completed removal of OSD 1

If multiple removal jobs are fired in parallel, there is a risk of
losing data since we will forcefully remove the OSD. It's also simply
true if a single OSD is not safe to destroy, there is also a risk of
data loss.

So now, we check if the OSD is safe-to-destroy first and then proceed.
The code waits forever and retries every minute unless the
--force-osd-removal flag is passed.

Signed-off-by: Sébastien Han <seb@redhat.com>
@leseb leseb changed the title osd: check if osd is ok-to-stop before removal osd: check if osd is safe-to-destroy before removal Nov 25, 2021
@leseb leseb merged commit 7a223ad into rook:master Nov 25, 2021
@leseb leseb deleted the osd-rm-check branch November 25, 2021 09:55
leseb added a commit that referenced this pull request Nov 29, 2021
osd: check if osd is safe-to-destroy before removal (backport #9230)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants