Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tenant pods fail after certificate renewal: certificate signed by unknown authority #2058

Open
thedulus opened this issue Apr 1, 2024 · 3 comments
Assignees

Comments

@thedulus
Copy link

thedulus commented Apr 1, 2024

Our minio tenant pods fail regularly whenever the internal certificate was renewed by the operator:

Error: Post "https://miniotenant-ss-0-1.miniotenant-hl.cld-1225.svc.cluster.local:9000/minio/peer/v37/backgroundhealstatus": tls: failed to verify certificate: x509: certificate signed by unknown authority (*rest.NetworkError)

It appears that minio doesn't process any changes made to the CA files while the certs were renewed.

In our specific deployment we've combined the requestAutoCert: true setting with an externalCertSecret which is an external cert-manager issued LetsEncrypt certificate that we use to achieve E2E encryption via a passthrough Ingress object. I'm not sure if this contributes to the issue.

When curling the endpoint manually, you can see that it's already serving the renewed certificate (valid from Mar 28 16:51:34 2024 GMT):

$ curl -vik https://miniotenant-ss-0-1.miniotenant-hl.cld-1225.svc.cluster.local:9000/
*   Trying 10.129.4.20...
* TCP_NODELAY set
* Connected to miniotenant-ss-0-1.miniotenant-hl.cld-1225.svc.cluster.local (10.129.4.20) port 9000 (#0)
<...>
* Server certificate:
*  subject: O=system:nodes; CN=system:node:*.miniotenant-hl.cld-1225.svc.cluster.local
*  start date: Mar 28 16:51:34 2024 GMT
*  expire date: Apr 15 23:29:11 2024 GMT
*  issuer: CN=kube-csr-signer_@1710631750
* <...>

which matches the renewed CA within the container:

$ cat /tmp/certs/CAs/hostname-1.crt 
-----BEGIN CERTIFICATE-----
MIIDfDCCAmSgAwIBAgIRAP7oVlZ1NuDmhsaLw4NVwAMwDQYJKoZIhvcNAQELBQAw
JjEkMCIGA1UEAwwba3ViZS1jc3Itc2lnbmVyX0AxNzEwNjMxNzUwMB4XDTI0MDMy
ODE2NTEzNFoXDTI0MDQxNTIzMjkxMVowWTEVMBMGA1UEChMMc3lzdGVtOm5vZGVz
MUAwPgYDVQQDDDdzeXN0ZW06bm9kZToqLm1pbmlvdGVuYW50LWhsLmNsZC0xMjI1
LnN2Yy5jbHVzdGVyLmxvY2FsMFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEy70p
Zo4d0j7rBJGM0gwKt9oSqaG3M/38a+9RfwUMxb/N1dzGAEpyCyHfvQeQRP4C8wbZ
kXASmau3qH2GW25tLqOCATswggE3MA4GA1UdDwEB/wQEAwIFoDATBgNVHSUEDDAK
BggrBgEFBQcDATAMBgNVHRMBAf8EAjAAMB8GA1UdIwQYMBaAFH807HJskwpoeu60
6qVPC/yTJva3MIHgBgNVHREEgdgwgdWCQm1pbmlvdGVuYW50LXNzLTAtezAuLi4x
fS5taW5pb3RlbmFudC1obC5jbGQtMTIyNS5zdmMuY2x1c3Rlci5sb2NhbIIgbWlu
aW8uY2xkLTEyMjUuc3ZjLmNsdXN0ZXIubG9jYWyCDm1pbmlvLmNsZC0xMjI1ghJt
aW5pby5jbGQtMTIyNS5zdmOCKyoubWluaW90ZW5hbnQtaGwuY2xkLTEyMjUuc3Zj
LmNsdXN0ZXIubG9jYWyCHCouY2xkLTEyMjUuc3ZjLmNsdXN0ZXIubG9jYWwwDQYJ
KoZIhvcNAQELBQADggEBAKb8FZZ8qewUtzmGGVlOMnZnJN064Nq2RWoNqNHz2mHz
JyabvVGD/ogLKbN7rKNkWfnZSzTsZv9OFjSmpEQkTq1duuKPDWxxdu/g3AVD6uiJ
Dy3WtTAKUTKugGCzt0Vv9WfEawvtoYGvJVFRg8MPbEvct9CugGdiXrDjUOJDh3DK
ABI5NawwsgPfqy8XsdMaDnLevh9mvDyQmWOSzw6Z0MftRxucXnc0YDsHPWXEG3TK
70a2yQJWttpKIQPpS5oEj4lxirum1BqRjIuS4pO9XXlo21RhQDGES26siSVBYt/f
WPAZJeRO5uuJu2h6qGL5BA1UH/op5Z8V/ELXA9rj6l4=
-----END CERTIFICATE-----

---
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            fe:e8:56:56:75:36:e0:e6:86:c6:8b:c3:83:55:c0:03
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN=kube-csr-signer_@1710631750
        Validity
            Not Before: Mar 28 16:51:34 2024 GMT
            Not After : Apr 15 23:29:11 2024 GMT
        Subject: O=system:nodes, CN=system:node:*.miniotenant-hl.cld-1225.svc.cluster.local
---        

I suspect, this is the CA that should be used by minio, however it seems like minio is still using an outdated one from memory to verify requests to its cluster members, as the error message tls: failed to verify certificate: x509: certificate signed by unknown authority suggests. It looks like the renewed certificate file itself was read from disk, but the CA file wasn't.

Of course, the /tmp/certs/CAs/ directory also contains the Root CA of the Letsencrypt authority R3 (externalCertSecret), but that'll only become an issue when that specific certificate is renewed in a couple of weeks. So we'll ignore it for now.

Expected Behavior

Minio should automatically process certificate renewals and all tenant pods must process changes made to the CA files as well.

Current Behavior

This issue can be resolved temporarily (until next certificate renewal) by restarting the minio service with the mc CLI:

mc admin service restart <tenant>

This proves that this isn't an issue with the certificates themselves, but with minio not processing the renewed certs correctly, as this command doesn't alter the certificate files at all. They are probably just re-read from disk when the service is restarted.

Possible Solution

n.a.

Steps to Reproduce (for bugs)

  1. Spin up a minio tenant using this yaml:
apiVersion: minio.min.io/v2
kind: Tenant
metadata:
  labels:
    app.kubernetes.io/component: miniotenant
    app.kubernetes.io/instance: miniotenant
    app.kubernetes.io/name: miniotenant
  name: miniotenant
  namespace: cld-1225
scheduler:
  name: ''
spec:
  requestAutoCert: true
  exposeServices:
    console: false
    minio: false
  serviceAccountName: miniotenant-sa
  users:
    - name: miniotenant-user-1
  imagePullSecret: {}
  imagePullPolicy: IfNotPresent
  configuration:
    name: miniotenant-env-configuration
  pools:
    - affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: v1.min.io/tenant
                    operator: In
                    values:
                      - miniotenant
              topologyKey: kubernetes.io/hostname
      name: ss-0
      resources:
        limits:
          cpu: 50m
          memory: 400Mi
        requests:
          cpu: 2m
          memory: 200Mi
      servers: 2
      volumeClaimTemplate:
        apiVersion: v1
        kind: persistentvolumeclaims
        metadata: {}
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 1Gi
          storageClassName: csi-rbd-sc
        status: {}
      volumesPerServer: 2
  podManagementPolicy: Parallel
  image: 'minio/minio:RELEASE.2024-02-17T01-15-57Z'
  features:
    domains:
      console: 'https://minio-console-cld-1225.apps.<openshift-cluster>.com'
      minio:
        - 'https://minio-cld-1225.apps.<openshift-cluster>.com'
    enableSFTP: false
  mountPath: /export
  externalCertSecret:
    - name: miniotenant-certificate-secret-tls
      type: kubernetes.io/tls
status:
  usage:
    capacity: 2055798784
    rawCapacity: 4294967296
    rawUsage: 70713344
    usage: 70713344
  availableReplicas: 2
  healthMessage: Service Unavailable
  healthStatus: red
  provisionedUsers: true
  pools:
    - legacySecurityContext: false
      ssName: miniotenant-ss-0
      state: PoolInitialized
  currentState: Initialized
  drivesOffline: 2
  revision: 0
  certificates:
    autoCertEnabled: true
    customCertificates:
      minio:
        - certName: miniotenant-certificate-secret-tls
          domains:
            - minio-cld-1225.apps.<openshift-cluster>.com
            - minio-cld-1225.apps.<openshift-cluster>.com
            - minio-console-cld-1225.apps.<openshift-cluster>.com
          expiresIn: '72 days, 22 hours, 55 minutes, 32 seconds'
          expiry: '2024-06-13T13:58:12Z'
          serialNo: '426110582089806621805261802254366324527450'
        - certName: miniotenant-certificate-secret-tls
          domains:
            - R3
          expiresIn: '532 days, 0 hours, 57 minutes, 20 seconds'
          expiry: '2025-09-15T16:00:00Z'
          serialNo: '192961496339968674994309121183282847578'
  drivesOnline: 2
  syncVersion: v5.0.0
  writeQuorum: 3
  1. Wait until the certificate and CA is renewed by the Operator
  2. Check log for requests to other cluster members, they should fail with the mentioned error message, also the Tenant's object /status/healthMessage is Service Unavailable

Context

Regression

Your Environment

  • Version used (minio-operator): minio-operator v5.0.13
  • Environment name and version (e.g. kubernetes v1.17.2): Red Hat OpenShift v4.12
  • Server type and version: minio/minio:RELEASE.2024-02-17T01-15-57Z
  • Operating System and version (uname -a): n.a.
  • Link to your deployment file: see "Possible Solution" above
@cniackz
Copy link
Contributor

cniackz commented Apr 8, 2024

AR @cniackz please test this in openshift and see if you can reproduce; then talk to our expert @pjuarezd and get some advice/help.

Also maybe this can be of help:

#1971
#1973

because https://github.com/minio/operator/blob/master/docs/cert-manager.md#create-operator-ca-tls-secret is as per design a manual thing we need to do on every renewal...

@cniackz
Copy link
Contributor

cniackz commented Apr 8, 2024

Hey guys, I have an idea. Why are you using requestAutoCert: true here? Based on my testing, when using cert-manager, you should disable it:

Check this out:

spec:
  ## Disable default TLS certificates.
  requestAutoCert: false

Could you please try disabling it and use our example, or something similar? This will still require manual steps while performing this process, but at least you won't rely on Operator certificates anymore, only on cert-manager. Once we have a working solution for the rotation, this shouldn't cause any further problems.

Also, if you get a chance, please try the ideas from the following PRs and let us know if they work for you in OpenShift:

@thedulus
Copy link
Author

Hi @cniackz,

the reason we are merging the externalCertSecret (issued by a cert-manager instance) and requestAutoCert (issued by the MinIO Operator) is because you cannot request certificates for cluster-internal domain names (.svc.cluster.local) via cert-manager.

"The certificate request has failed to complete and will be retried: Failed to wait for order resource "minio-console-cld-1225.apps.<openshift-cluster>.com-kwmf5-4021417717" to become ready: order is in "errored" state: Failed to create Order: 400 urn:ietf:params:acme:error:rejectedIdentifier: Error creating new order :: Cannot issue for "*.cld-1225.svc.cluster.local": Domain name does not end with a valid public suffix (TLD) (and 2 more problems. Refer to sub-problems for more information.); subproblems:\n\turn:ietf:params:acme:error:malformed: [dns: *.cld-1225.svc.cluster.local] Error creating new order :: Domain name does not end with a valid public suffix (TLD)\n\turn:ietf:params:acme:error:malformed: [dns: *.minio.cld-1225.svc.cluster.local] Error creating new order :: Domain name does not end with a valid public suffix (TLD)\n\turn:ietf:params:acme:error:malformed: [dns: *.miniotenant-hl.cld-1225.svc.cluster.local] Error creating new order :: Domain name does not end with a valid public suffix (TLD)

So in order to get valid certificates for inter-pod communication (meaning between multiple MinIO cluster member pods) we need requestAutoCert: true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants