Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The rgw-multisite-testing canary test is failing frequently #11744

Closed
travisn opened this issue Feb 22, 2023 · 4 comments
Closed

The rgw-multisite-testing canary test is failing frequently #11744

travisn opened this issue Feb 22, 2023 · 4 comments
Assignees

Comments

@travisn
Copy link
Member

travisn commented Feb 22, 2023

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:
As pointed out in #11741, the rgw-multisite-testing test is failing about half the time in master.

The test fails in the run RGW multisite test:

+ verify_operator_log_message 'there are no changes to commit for RGW configuration period for CephObjectStore "rook-ceph-secondary/zone-b"' rook-ceph
+ local 'message=there are no changes to commit for RGW configuration period for CephObjectStore "rook-ceph-secondary/zone-b"'
+ local namespace=rook-ceph
+ kubectl --namespace rook-ceph logs deployment/rook-ceph-operator
+ grep 'there are no changes to commit for RGW configuration period for CephObjectStore "rook-ceph-secondary/zone-b"'
+ sleep 5
+ [[ 125 -lt 120 ]]
+ echo 'timed out'
timed out
+ return 1
Error: Process completed with exit code 1.

Expected behavior:
Passing tests

@travisn travisn added the bug label Feb 22, 2023
@travisn travisn added this to To do in v1.11 via automation Feb 22, 2023
@travisn
Copy link
Member Author

travisn commented Feb 22, 2023

@BlaineEXE Could you take a look? This looks related to #8911

@BlaineEXE
Copy link
Member

Two test failures have messages like this:

2023-02-24 00:03:22.756927 E | ceph-object-zone-controller: failed to reconcile CephObjectZone "rook-ceph-secondary/zone-b". invalid CephObjectZone CR "zone-b": invalid metadata pool spec: failed to get crush map: failed to get crush map. . Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',): exit status 1

Rook seems unable to create the secondary CephObjectZone CR, but it is able to create the secondary CephObjectStore CR.

This is the command being called that is failing.

2023-02-24 00:05:31.374525 D | exec: Running command: ceph osd crush dump --connect-timeout=15 --cluster=rook-ceph-secondary --conf=/var/lib/rook/rook-ceph-secondary/rook-ceph-secondary.config --name=client.admin --keyring=/var/lib/rook/rook-ceph-secondary/client.admin.keyring --format json

I can't find the exact place where osd pool dump is being called to validate the pools for zone-a. It appears that zone-a might succeed when the pools aren't validated, and I'm not sure why that would be. I have added more debugging in #11754 to see if I can narrow down the issue further.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@travisn travisn removed this from To do in v1.11 May 5, 2023
@github-actions
Copy link

github-actions bot commented May 6, 2023

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants