Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition when two identical certificate requests are made from different clusters #6229

Open
pedrohdz opened this issue Jul 21, 2023 · 12 comments
Labels
area/acme/dns01 Indicates a PR modifies ACME DNS01 provider code help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@pedrohdz
Copy link

Describe the bug:

Seems like we have a race condition when two Kubernetes clusters with similar (near identical) configuration try to request a certificate using DNS01 to create the same certificate.

We have two Kubernetes clusters, one primary and the other secondary for disaster recovery. Both are in different Azure regions and configured almost identically.

Identical Certificate resources are created on both clusters within a minute of each other. The first one to get a response from LetsEncrypt deletes the _acme-challenge DNS record and the second cluster's Certificate resource is left in a READY state of False indefinitely.

---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-live
spec:
  acme:
    preferredChain: ""
    privateKeySecretRef:
      name: letsencrypt-key-secret-live
    server: https://acme-v02.api.letsencrypt.org/directory
    solvers:
    - dns01:
        azureDNS:
          environment: AzurePublicCloud
          hostedZoneName: env.REDACTED.com
          resourceGroupName: REDACTED_RESOURCE_GROUP_NAME
          subscriptionID: REDACTED_SUBSCRIPTION_ID
        cnameStrategy: Follow
      selector:
        dnsZones:
        - REDACTED.com

---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: FOO-prod-certificate
  namespace: FOO-prod
spec:
  dnsNames:
  - FOO.REDACTED.com
  duration: 2160h0m0s
  issuerRef:
    kind: ClusterIssuer
    name: letsencrypt-live
  privateKey:
    algorithm: RSA
    encoding: PKCS1
    size: 2048
  renewBefore: 720h0m0s
  secretName: FOO-prod-certificate
  usages:
  - server auth
  - client auth
  - key encipherment
  - digital signature
...

Expected behaviour:

The second cluster should detect that the _acme-challenge DNS record was deleted, and then re-attempt the request with LetsEncrypt.

Steps to reproduce the bug:

See above.

Anything else we need to know?:

Environment details::

  • Kubernetes version: 1.25.6
  • Cloud-provider/provisioner: Azure
  • cert-manager version: quay.io/jetstack/cert-manager-acmesolver:v1.11.1
  • Install method: Helm Chart Version 1.11.1

/kind bug

@jetstack-bot jetstack-bot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 21, 2023
@irbekrm
Copy link
Collaborator

irbekrm commented Jul 21, 2023

We've had this issue with a couple different DNS providers and has been solved for Route53 and CloudDNS so far, see

There probably is some way how to achieve the same for Azure DNS. None of the maintainers are very familiar with Azure and we also don't have Azure infra to test this, so perhaps would be good if an external contributor who is an Azure user and can test it on Azure could pick this up.

@irbekrm irbekrm added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. area/acme/dns01 Indicates a PR modifies ACME DNS01 provider code priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Jul 21, 2023
@michalg91
Copy link

The same problem exists for cloudflare dns

@eplightning
Copy link
Contributor

For Azure DNS this can be solved using optimistic concurrency their APIs support.

I have some working PoC I never attempted to upstream as I'm quite clueless how to properly and consistently test something like this:
eplightning@2f1fbb1

@chr15murray
Copy link

chr15murray commented Aug 31, 2023

We are hitting the same issue with the Cloudflare provider.

Would a more simple approach be to allow the text record name to be changed to something other than _acme-challenge.*?

That way clusterA could use _acme-challenge-clusterA and clusterB could use _acme-challenge-clusterB TXT records.

This would then fix it for all providers. I believe this is hardcoded here but could potentially be exposed as a config item (env, start flags or clusterIssuer spec)

This would also be easier to add test cases for

@michalg91
Copy link

michalg91 commented Aug 31, 2023

@chr15murray there is a change merged into main branch #5884 & #6191, i read that it will be released with 1.13-alpha1 image. We had same issue with our clusters and we managed to build main branch - it is working.

@chr15murray
Copy link

@chr15murray there is a change merged into main branch #5884 & #6191, i read that it will be released with 1.13-alpha1 image. We had same issue with our clusters and we managed to build main branch - it is working.

Thanks, will look to get this deployed. Thanks @michalg91

@jetstack-bot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 29, 2023
@jetstack-bot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle rotten
/remove-lifecycle stale

@jetstack-bot jetstack-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 29, 2023
@eplightning
Copy link
Contributor

/remove-lifecycle rotten

@jetstack-bot jetstack-bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 7, 2024
@eplightning
Copy link
Contributor

eplightning commented Jan 7, 2024

Continues to be an issue for Azure DNS. There's a PR open that should fix the issue: #6351

@jetstack-bot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 6, 2024
@eplightning
Copy link
Contributor

/remove-lifecycle stale

@jetstack-bot jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/acme/dns01 Indicates a PR modifies ACME DNS01 provider code help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

6 participants