Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RGW bucket health check fails with SSL hostname error #8663

Closed
logan2211 opened this issue Sep 8, 2021 · 9 comments · Fixed by #8712
Closed

RGW bucket health check fails with SSL hostname error #8663

logan2211 opened this issue Sep 8, 2021 · 9 comments · Fixed by #8712
Assignees
Labels
Projects

Comments

@logan2211
Copy link

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:

Bucket health check fails with:

rook-ceph-operator-8f7d85c4c-v9gqq rook-ceph-operator 2021-09-08 19:21:40.049609 E | ceph-object-controller: failed to delete object in bucket. RequestError: send request failed
rook-ceph-operator-8f7d85c4c-v9gqq rook-ceph-operator caused by: Delete "https://rook-ceph-rgw-objectstore.rook-ceph.svc:443/rook-ceph-bucket-checker-3207cc24-9382-4ec2-9c7f-6a8540773a39/rookHealthCheckTestObject": x509: certificate is valid for objects.dfw1.cloud.domain.net, www.objects.dfw1.cloud.domain.net, not rook-ceph-rgw-objectstore.rook-ceph.svc

This seems to be a follow on issue from the other bugs related to this feature that were discussed in #7288. Now that Rook has been fixed to use the correct port for SSL enabled rgw in #7331, it seems like this is the next issue SSL enabled rgw deployments will hit with this check.

Expected behavior:
Bucket test success. It seems like there should be one or two more configuration switches for this feature:

  1. Allow enabling insecure SSL for the bucket test, which would finish the test despite the cert name mismatch. If using the k8s .svc endpoint, maybe this should be enabled by default.
  2. Allow configuring the hostname to be used for the bucket test

How to reproduce it (minimal and precise):

Create a Rook cluster + SSL enabled object store, make sure the cert does not name the k8s .svc endpoint in its hostnames, and watch the bucket test fail.

Environment:

  • OS (e.g. from /etc/os-release): Ubuntu Bionic
  • Rook version (use rook version inside of a Rook Pod): v1.5.12
  • Storage backend version (e.g. for ceph do ceph -v): ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)
  • Kubernetes version (use kubectl version): v1.17
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): kubeadm
@thotz
Copy link
Contributor

thotz commented Sep 9, 2021

@logan2211 : are suggesting using an insecure path for a bucket health check. Please note Buckethealthchecker itself is an s3 client for ceph object store or RGW lying in the Rook codebase.So it is expected to fail IMO it is right to fix the test failure.
@leseb : any thoughts?

@leseb
Copy link
Member

leseb commented Sep 9, 2021

Agreed with @thotz, the healthcheck is really the internal representation of "is the endpoint alive and can I interact with it?". I guess I have a question @logan2211, is this gateway only accessible externally? That's why you "don't care" about the internal check? Thanks!

@logan2211
Copy link
Author

Sorry for any confusion.. the rgw is internal to the rook cluster and all resources are managed by rook, but the rgw apis are consumed by external clients via a loadbalancer service. I'm interested Rook's internal representation of whether the RGW instances that it creates and manages are alive and functioning well, and I don't think whether the clients consuming the storage are internal or external has any bearing on the utility of this health check.

Mainly the issue with the health check is the configuration is not yet flexible enough to accommodate situations where the RGW SSL cert does not include the internal cluster DNS endpoint (requiring that the health check support a custom URL, insecure flag, or both), or where the RGW might be using a self-signed cert, in which case it would be useful to have an insecure flag.

In either of these situations I laid out, the health check reports failures despite a well functioning object store, and currently the only way to fix it is by disabling the check.

@leseb
Copy link
Member

leseb commented Sep 14, 2021

@logan2211 thanks for the clarification, I think it makes sense. I concur that having to accommodate the signed certs with the internal cluster DNS endpoint can be an issue. Is it not feasible on your end?

A simple fix would be to use insecure TLS by not validating the certificate when performing the internal check. IMO it's not a security concern, connections are still encrypted and the check is local to the cluster. At least, this connectivity check's goal remains untouched.

@leseb leseb added the beginner label Sep 14, 2021
@leseb leseb added this to To do in v1.7 via automation Sep 14, 2021
leseb added a commit to leseb/rook that referenced this issue Sep 14, 2021
We have seen case where the signed certificate used for the RGW does not
contain the internal DNS endpoint, resulting in the health check to fail
since the certificate is not valid for this domain.
People consuming the gateways by external clients and for specific
domains do not necessary have the internal DNS configured in the
certificate.
So let's be a bit more flexible and simply ensure a connectivity check
and bypass the certificate validation.

Closes: rook#8663
Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this issue Sep 14, 2021
We have seen case where the signed certificate used for the RGW does not
contain the internal DNS endpoint, resulting in the health check to fail
since the certificate is not valid for this domain.
People consuming the gateways by external clients and for specific
domains do not necessary have the internal DNS configured in the
certificate.
So let's be a bit more flexible and simply ensure a connectivity check
and bypass the certificate validation.

Closes: rook#8663
Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this issue Sep 14, 2021
We have seen cases where the signed certificate used for the RGW does not
contain the internal DNS endpoint, resulting in the health check to fail
since the certificate is not valid for this domain.
People consuming the gateways by external clients and for specific
domains do not necessarily have the internal DNS configured in the
certificate.
So let's be a bit more flexible and simply ensure a connectivity check
and bypass the certificate validation.

Closes: rook#8663
Signed-off-by: Sébastien Han <seb@redhat.com>
@leseb
Copy link
Member

leseb commented Sep 14, 2021

I've opened #8712 but would like other maintainers to chime in. @BlaineEXE when you have time PTAL. Thanks!

@logan2211
Copy link
Author

@logan2211 thanks for the clarification, I think it makes sense. I concur that having to accommodate the signed certs with the internal cluster DNS endpoint can be an issue. Is it not feasible on your end?

Maybe -- I'm not sure if RGW supports SNI so I could provide both certs.

A simple fix would be to use insecure TLS by not validating the certificate when performing the internal check. IMO it's not a security concern, connections are still encrypted and the check is local to the cluster. At least, this connectivity check's goal remains untouched.

Yes I think the insecure flag would not be a bad idea. An additional option to set the endpoint URI would be an even more useful flag for me since I could have the added benefit of relying on the health check to fail if the cert expires or some other cert-related issue occurs.

@leseb
Copy link
Member

leseb commented Sep 15, 2021

@logan2211 thanks for the clarification, I think it makes sense. I concur that having to accommodate the signed certs with the internal cluster DNS endpoint can be an issue. Is it not feasible on your end?

Maybe -- I'm not sure if RGW supports SNI so I could provide both certs.

A simple fix would be to use insecure TLS by not validating the certificate when performing the internal check. IMO it's not a security concern, connections are still encrypted and the check is local to the cluster. At least, this connectivity check's goal remains untouched.

Yes I think the insecure flag would not be a bad idea. An additional option to set the endpoint URI would be an even more useful flag for me since I could have the added benefit of relying on the health check to fail if the cert expires or some other cert-related issue occurs.

I don't think this is something we are looking at supporting at the moment.

@ashangit
Copy link
Member

Hi,
We are seeing same issue at our end with also some certificates not including internal cluster DNS endpoint.
The issue seems to be more broader as it also block objectstore user reconcile

2021-09-21 20:41:18.907481 D | ceph-object-store-user-controller: object store user "par-cephs3-backup-s01-preprod/s3-monitor-user" status updated to "ReconcileFailed"
2021-09-21 20:41:18.907532 E | ceph-object-store-user-controller: failed to reconcile failed to create/update object store user "s3-monitor-user": failed to get details from ceph object user "s3-monitor-user": Get "https://rook-ceph-rgw-backup.par-cephs3-backup-s01-preprod.svc:19000/admin/user?display-name=User%20used%20to%20get%20ceph%20admin%20radosgw%20metrics&format=json&max-buckets=1000&uid=s3-monitor-user": x509: certificate is valid for *.domain, domain, not rook-ceph-rgw-backup.par-cephs3-backup-s01-preprod.svc

To mitigate the issue we will apply this "workaround patch" on our internal rook operator forks criteo-forks#6 as we are not able to generate appropriate certificate (internal limitation)

@BlaineEXE
Copy link
Member

@ashangit we are moving forward with #8712 to fix this issue. It will take us a few days to square it away. It will be backported to Rook v1.7 and should be part of the next patch release. The criteo fork shouldn't be necessary after that. Thanks for your and everyone's patience.

leseb added a commit to leseb/rook that referenced this issue Sep 27, 2021
We have seen cases where the signed certificate used for the RGW does not
contain the internal DNS endpoint, resulting in the health check to fail
since the certificate is not valid for this domain.
People consuming the gateways by external clients and for specific
domains do not necessarily have the internal DNS configured in the
certificate.
So let's be a bit more flexible and simply ensure a connectivity check
and bypass the certificate validation.

Closes: rook#8663
Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this issue Sep 27, 2021
We have seen cases where the signed certificate used for the RGW does not
contain the internal DNS endpoint, resulting in the health check to fail
since the certificate is not valid for this domain.
People consuming the gateways by external clients and for specific
domains do not necessarily have the internal DNS configured in the
certificate.
So let's be a bit more flexible and simply ensure a connectivity check
and bypass the certificate validation.

Also, this is fixing the tls code in newS3Agent and adds unit tests.

Closes: rook#8663
Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this issue Sep 27, 2021
We have seen cases where the signed certificate used for the RGW does not
contain the internal DNS endpoint, resulting in the health check to fail
since the certificate is not valid for this domain.
People consuming the gateways by external clients and for specific
domains do not necessarily have the internal DNS configured in the
certificate.
So let's be a bit more flexible and simply ensure a connectivity check
and bypass the certificate validation.

Also, this is fixing the tls code in newS3Agent and adds unit tests.

Closes: rook#8663
Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this issue Sep 27, 2021
We have seen cases where the signed certificate used for the RGW does not
contain the internal DNS endpoint, resulting in the health check to fail
since the certificate is not valid for this domain.
People consuming the gateways by external clients and for specific
domains do not necessarily have the internal DNS configured in the
certificate.
So let's be a bit more flexible and simply ensure a connectivity check
and bypass the certificate validation.

Also, this is fixing the tls code in newS3Agent and adds unit tests.

Closes: rook#8663
Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this issue Sep 27, 2021
We have seen cases where the signed certificate used for the RGW does not
contain the internal DNS endpoint, resulting in the health check to fail
since the certificate is not valid for this domain.
People consuming the gateways by external clients and for specific
domains do not necessarily have the internal DNS configured in the
certificate.
So let's be a bit more flexible and simply ensure a connectivity check
and bypass the certificate validation.

Also, this is fixing the tls code in newS3Agent and adds unit tests.

Closes: rook#8663
Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this issue Sep 27, 2021
We have seen cases where the signed certificate used for the RGW does not
contain the internal DNS endpoint, resulting in the health check to fail
since the certificate is not valid for this domain.
People consuming the gateways by external clients and for specific
domains do not necessarily have the internal DNS configured in the
certificate.
So let's be a bit more flexible and simply ensure a connectivity check
and bypass the certificate validation.

Also, this is fixing the tls code in newS3Agent and adds unit tests.

Closes: rook#8663
Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this issue Sep 28, 2021
We have seen cases where the signed certificate used for the RGW does not
contain the internal DNS endpoint, resulting in the health check to fail
since the certificate is not valid for this domain.
People consuming the gateways by external clients and for specific
domains do not necessarily have the internal DNS configured in the
certificate.
So let's be a bit more flexible and simply ensure a connectivity check
and bypass the certificate validation.

Also, this is fixing the tls code in newS3Agent and adds unit tests.

Closes: rook#8663
Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this issue Sep 28, 2021
We have seen cases where the signed certificate used for the RGW does not
contain the internal DNS endpoint, resulting in the health check to fail
since the certificate is not valid for this domain.
People consuming the gateways by external clients and for specific
domains do not necessarily have the internal DNS configured in the
certificate.
So let's be a bit more flexible and simply ensure a connectivity check
and bypass the certificate validation.

Also, this is fixing the tls code in newS3Agent and adds unit tests.

Closes: rook#8663
Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this issue Sep 28, 2021
We have seen cases where the signed certificate used for the RGW does not
contain the internal DNS endpoint, resulting in the health check to fail
since the certificate is not valid for this domain.
People consuming the gateways by external clients and for specific
domains do not necessarily have the internal DNS configured in the
certificate.
So let's be a bit more flexible and simply ensure a connectivity check
and bypass the certificate validation.

Also, this is fixing the tls code in newS3Agent and adds unit tests.

Closes: rook#8663
Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this issue Sep 28, 2021
We have seen cases where the signed certificate used for the RGW does not
contain the internal DNS endpoint, resulting in the health check to fail
since the certificate is not valid for this domain.
People consuming the gateways by external clients and for specific
domains do not necessarily have the internal DNS configured in the
certificate.
So let's be a bit more flexible and simply ensure a connectivity check
and bypass the certificate validation.

Also, this is fixing the tls code in newS3Agent and adds unit tests.

Closes: rook#8663
Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this issue Sep 28, 2021
We have seen cases where the signed certificate used for the RGW does not
contain the internal DNS endpoint, resulting in the health check to fail
since the certificate is not valid for this domain.
People consuming the gateways by external clients and for specific
domains do not necessarily have the internal DNS configured in the
certificate.
So let's be a bit more flexible and simply ensure a connectivity check
and bypass the certificate validation.

Also, this is fixing the tls code in newS3Agent and adds unit tests.

Closes: rook#8663
Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this issue Sep 28, 2021
We have seen cases where the signed certificate used for the RGW does not
contain the internal DNS endpoint, resulting in the health check to fail
since the certificate is not valid for this domain.
People consuming the gateways by external clients and for specific
domains do not necessarily have the internal DNS configured in the
certificate.
So let's be a bit more flexible and simply ensure a connectivity check
and bypass the certificate validation.

Also, this is fixing the tls code in newS3Agent and adds unit tests.

Closes: rook#8663
Signed-off-by: Sébastien Han <seb@redhat.com>
v1.7 automation moved this from To do to Done Sep 28, 2021
mergify bot pushed a commit that referenced this issue Sep 28, 2021
We have seen cases where the signed certificate used for the RGW does not
contain the internal DNS endpoint, resulting in the health check to fail
since the certificate is not valid for this domain.
People consuming the gateways by external clients and for specific
domains do not necessarily have the internal DNS configured in the
certificate.
So let's be a bit more flexible and simply ensure a connectivity check
and bypass the certificate validation.

Also, this is fixing the tls code in newS3Agent and adds unit tests.

Closes: #8663
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit cda5dad)

# Conflicts:
#	tests/integration/ceph_object_test.go
leseb added a commit that referenced this issue Sep 28, 2021
We have seen cases where the signed certificate used for the RGW does not
contain the internal DNS endpoint, resulting in the health check to fail
since the certificate is not valid for this domain.
People consuming the gateways by external clients and for specific
domains do not necessarily have the internal DNS configured in the
certificate.
So let's be a bit more flexible and simply ensure a connectivity check
and bypass the certificate validation.

Also, this is fixing the tls code in newS3Agent and adds unit tests.

Closes: #8663
Manual cherry-pick from 0ff9fd3
Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit that referenced this issue Sep 28, 2021
We have seen cases where the signed certificate used for the RGW does not
contain the internal DNS endpoint, resulting in the health check to fail
since the certificate is not valid for this domain.
People consuming the gateways by external clients and for specific
domains do not necessarily have the internal DNS configured in the
certificate.
So let's be a bit more flexible and simply ensure a connectivity check
and bypass the certificate validation.

Also, this is fixing the tls code in newS3Agent and adds unit tests.

Closes: #8663
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit cda5dad)
leseb added a commit that referenced this issue Sep 28, 2021
We have seen cases where the signed certificate used for the RGW does not
contain the internal DNS endpoint, resulting in the health check to fail
since the certificate is not valid for this domain.
People consuming the gateways by external clients and for specific
domains do not necessarily have the internal DNS configured in the
certificate.
So let's be a bit more flexible and simply ensure a connectivity check
and bypass the certificate validation.

Also, this is fixing the tls code in newS3Agent and adds unit tests.

Closes: #8663
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit cda5dad)
shaas pushed a commit to SUSE/rook that referenced this issue Sep 29, 2021
We have seen cases where the signed certificate used for the RGW does not
contain the internal DNS endpoint, resulting in the health check to fail
since the certificate is not valid for this domain.
People consuming the gateways by external clients and for specific
domains do not necessarily have the internal DNS configured in the
certificate.
So let's be a bit more flexible and simply ensure a connectivity check
and bypass the certificate validation.

Also, this is fixing the tls code in newS3Agent and adds unit tests.

Closes: rook#8663
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit cda5dad)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
v1.7
Done
Development

Successfully merging a pull request may close this issue.

5 participants