Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

squid: rgw/notification: start/stop endpoint managers in notification manager #57470

Merged
merged 7 commits into from
May 30, 2024

Conversation

yuvalif
Copy link
Contributor

@yuvalif yuvalif commented May 15, 2024

backport tracker: https://tracker.ceph.com/issues/65996


backport of #56979
parent tracker: https://tracker.ceph.com/issues/65337

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh

@yuvalif yuvalif requested a review from a team as a code owner May 15, 2024 04:31
@yuvalif yuvalif added this to the squid milestone May 15, 2024
@yuvalif yuvalif added the rgw label May 15, 2024
@github-actions github-actions bot added the tests label May 15, 2024
Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
(cherry picked from commit cb9a09b)
Fixes: https://tracker.ceph.com/issues/65337

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
(cherry picked from commit 15536cf)

Conflicts:
	src/rgw/rgw_kafka.cc
* tests were passing only because they were not performings their asserts
* tests are now separated with their own attribute
* their topics are now marked "persistent" to workaround the issue in:
  https://tracker.ceph.com/issues/65645

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
(cherry picked from commit 70e5af8)
for example. job: 7697397
in test: yuvalif-2024-05-08_09:55:02-rgw:notifications-wip-yuval-65337-distro-default-smithi

also reduce the side of the error log by sending less objects to the
test_ps_s3_persistent_topic_stats test

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
(cherry picked from commit 184b9be)
in tests that require retries

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
(cherry picked from commit 1f509da)
fail test if not. to indicate this is a test issue
and not a product bug

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
(cherry picked from commit 9d56bbe)
@yuvalif
Copy link
Contributor Author

yuvalif commented May 15, 2024

jenkins test api

Copy link
Contributor

@cbodley cbodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@cbodley
Copy link
Contributor

cbodley commented May 17, 2024

the rerun showed two test failures and got stuck: https://qa-proxy.ceph.com/teuthology/cbodley-2024-05-16_14:27:17-rgw-wip-65996-squid-distro-default-smithi/7709243/teuthology.log

FAIL: test persistent topic stats
FAIL: test that when object is deleted due to lifecycle policy, notification is sent on master

@yuvalif
Copy link
Contributor Author

yuvalif commented May 20, 2024

the rerun showed two test failures and got stuck: https://qa-proxy.ceph.com/teuthology/cbodley-2024-05-16_14:27:17-rgw-wip-65996-squid-distro-default-smithi/7709243/teuthology.log

FAIL: test persistent topic stats
FAIL: test that when object is deleted due to lifecycle policy, notification is sent on master

to prevent the test from getting stuck, we need this fix: #57256
backported. will piggyback to this PR.

as for the actual issue. it is probably an issue with out http cient implementation or some environmental issue on the test machine.
from rgw log:

1287606-2024-05-16T19:59:46.597+0000 7f90c8495640 20 ERROR: msg->data.result=7 req_data->id=116426 http_status=0
1287607-2024-05-16T19:59:46.597+0000 7f90c8495640 20 ERROR: curl error: Couldn't connect to server req_data->error_buf=Failed to connect to localhost port 10930 after 0 ms: Connection refused
1287608-2024-05-16T19:59:46.597+0000 7f90c8495640 20 ERROR: msg->data.result=7 req_data->id=116427 http_status=0
1287609-2024-05-16T19:59:46.597+0000 7f90c8495640 20 ERROR: curl error: Couldn't connect to server req_data->error_buf=Failed to connect to localhost port 10930 after 0 ms: Connection refused
1287610-2024-05-16T19:59:46.597+0000 7f90c7c94640  5 rgw notify: WARNING: push entry: 0/27265 to endpoint: http://localhost:10930 failed. error: -2200 (will retry)
1287611:2024-05-16T19:59:46.597+0000 7f90c7c94640 20 rgw notify: INFO: new end marker for removal: 0/24576 from: :wtbyxm-11_topic
1287612:2024-05-16T19:59:46.597+0000 7f90c7c94640 20 rgw notify: INFO: processing of entry: 0/27265 (6/20) from: :wtbyxm-11_topic failed

however, the test has checks to verify that the http server is up and running. and even verifies a dummy POST request is sent, and received by the server.
from the test log:

2024-05-16T20:21:02.281 INFO:teuthology.orchestra.run.smithi080.stderr:bucket_notification.test_bn: INFO: http server created on ('localhost', 10930)
2024-05-16T20:21:02.281 INFO:teuthology.orchestra.run.smithi080.stderr:bucket_notification.test_bn: INFO: http server started on ('localhost', 10930)
2024-05-16T20:21:02.281 INFO:teuthology.orchestra.run.smithi080.stderr:urllib3.connectionpool: DEBUG: Starting new HTTP connection (1): localhost:10930
2024-05-16T20:21:02.281 INFO:teuthology.orchestra.run.smithi080.stderr:bucket_notification.test_bn: INFO: HTTP Server received iempty event
2024-05-16T20:21:02.282 INFO:teuthology.orchestra.run.smithi080.stderr:urllib3.connectionpool: DEBUG: http://localhost:10930 "POST / HTTP/1.1" 200 None

this is probably the same issue as: https://tracker.ceph.com/issues/66033

@cbodley
Copy link
Contributor

cbodley commented May 20, 2024

thanks @yuvalif. the timestamps of those log entries don't match up, as the curl error happened over a minute before the http server started. i guess the latter was from the next test case?

Copy link
Contributor

@cbodley cbodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waiting on inclusion of #57256

@yuvalif
Copy link
Contributor Author

yuvalif commented May 20, 2024

thanks @yuvalif. the timestamps of those log entries don't match up, as the curl error happened over a minute before the http server started. i guess the latter was from the next test case?

That does not make sense. The http server port matches the one in the failed test. As well as the topic name.
I use a different queue name and server port in each test.

this is a regressions from: 673adcb

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
(cherry picked from commit 3d473df)
@yuvalif
Copy link
Contributor Author

yuvalif commented May 20, 2024

waiting on inclusion of #57256

done

@cbodley
Copy link
Contributor

cbodley commented May 20, 2024

this is probably the same issue as: https://tracker.ceph.com/issues/66033

i don't think that tracker issue is due to a race with the http server's startup, because the test fails when looking for the deletion events after it's already verified that it saw the creation events

@yuvalif
Copy link
Contributor Author

yuvalif commented May 21, 2024

this is probably the same issue as: https://tracker.ceph.com/issues/66033

i don't think that tracker issue is due to a race with the http server's startup, because the test fails when looking for the deletion events after it's already verified that it saw the creation events

right, there is no test issue here. this commit: bbba46d
validates that, and fail the test explicitly on a test issue if there is one.
the problem is possibly in the way that our http client code recovers from faiures (these failures are happening after there is a deliberate failure in sending to the http server).
once #57550 is merged, it will be easier to seperate notification bugs from http specific bugs

Copy link
Contributor

@cbodley cbodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, approved but labeled DNM until we cut the squid rc

@cbodley cbodley merged commit 7a17a87 into ceph:squid May 30, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants