Race condition when two identical certificate requests are made from different clusters #6229

pedrohdz · 2023-07-21T07:49:59Z

Describe the bug:

Seems like we have a race condition when two Kubernetes clusters with similar (near identical) configuration try to request a certificate using DNS01 to create the same certificate.

We have two Kubernetes clusters, one primary and the other secondary for disaster recovery. Both are in different Azure regions and configured almost identically.

Identical Certificate resources are created on both clusters within a minute of each other. The first one to get a response from LetsEncrypt deletes the _acme-challenge DNS record and the second cluster's Certificate resource is left in a READY state of False indefinitely.

---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-live
spec:
  acme:
    preferredChain: ""
    privateKeySecretRef:
      name: letsencrypt-key-secret-live
    server: https://acme-v02.api.letsencrypt.org/directory
    solvers:
    - dns01:
        azureDNS:
          environment: AzurePublicCloud
          hostedZoneName: env.REDACTED.com
          resourceGroupName: REDACTED_RESOURCE_GROUP_NAME
          subscriptionID: REDACTED_SUBSCRIPTION_ID
        cnameStrategy: Follow
      selector:
        dnsZones:
        - REDACTED.com

---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: FOO-prod-certificate
  namespace: FOO-prod
spec:
  dnsNames:
  - FOO.REDACTED.com
  duration: 2160h0m0s
  issuerRef:
    kind: ClusterIssuer
    name: letsencrypt-live
  privateKey:
    algorithm: RSA
    encoding: PKCS1
    size: 2048
  renewBefore: 720h0m0s
  secretName: FOO-prod-certificate
  usages:
  - server auth
  - client auth
  - key encipherment
  - digital signature
...

Expected behaviour:

The second cluster should detect that the _acme-challenge DNS record was deleted, and then re-attempt the request with LetsEncrypt.

Steps to reproduce the bug:

See above.

Anything else we need to know?:

Environment details::

Kubernetes version: 1.25.6
Cloud-provider/provisioner: Azure
cert-manager version: quay.io/jetstack/cert-manager-acmesolver:v1.11.1
Install method: Helm Chart Version 1.11.1

/kind bug

The text was updated successfully, but these errors were encountered:

irbekrm · 2023-07-21T08:41:09Z

We've had this issue with a couple different DNS providers and has been solved for Route53 and CloudDNS so far, see

There probably is some way how to achieve the same for Azure DNS. None of the maintainers are very familiar with Azure and we also don't have Azure infra to test this, so perhaps would be good if an external contributor who is an Azure user and can test it on Azure could pick this up.

michalg91 · 2023-08-18T07:18:50Z

The same problem exists for cloudflare dns

eplightning · 2023-08-28T09:48:25Z

For Azure DNS this can be solved using optimistic concurrency their APIs support.

I have some working PoC I never attempted to upstream as I'm quite clueless how to properly and consistently test something like this:
eplightning@2f1fbb1

chr15murray · 2023-08-31T13:30:26Z

We are hitting the same issue with the Cloudflare provider.

Would a more simple approach be to allow the text record name to be changed to something other than _acme-challenge.*?

That way clusterA could use _acme-challenge-clusterA and clusterB could use _acme-challenge-clusterB TXT records.

This would then fix it for all providers. I believe this is hardcoded here but could potentially be exposed as a config item (env, start flags or clusterIssuer spec)

This would also be easier to add test cases for

michalg91 · 2023-08-31T16:51:40Z

@chr15murray there is a change merged into main branch #5884 & #6191, i read that it will be released with 1.13-alpha1 image. We had same issue with our clusters and we managed to build main branch - it is working.

chr15murray · 2023-08-31T17:55:58Z

@chr15murray there is a change merged into main branch #5884 & #6191, i read that it will be released with 1.13-alpha1 image. We had same issue with our clusters and we managed to build main branch - it is working.

Thanks, will look to get this deployed. Thanks @michalg91

jetstack-bot · 2023-11-29T18:11:29Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

jetstack-bot · 2023-12-29T18:41:04Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle rotten
/remove-lifecycle stale

eplightning · 2024-01-07T23:11:35Z

/remove-lifecycle rotten

eplightning · 2024-01-07T23:12:27Z

Continues to be an issue for Azure DNS. There's a PR open that should fix the issue: #6351

jetstack-bot · 2024-04-06T23:49:31Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
/lifecycle stale

eplightning · 2024-04-06T23:50:50Z

/remove-lifecycle stale

jetstack-bot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 21, 2023

eplightning mentioned this issue Sep 18, 2023

Handle multiple concurrent Azure DNS01 challenges for the same FQDN #6351

Merged

jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 29, 2023

jetstack-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 29, 2023

jetstack-bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 7, 2024

jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 6, 2024

jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition when two identical certificate requests are made from different clusters #6229

Race condition when two identical certificate requests are made from different clusters #6229

pedrohdz commented Jul 21, 2023

irbekrm commented Jul 21, 2023

michalg91 commented Aug 18, 2023

eplightning commented Aug 28, 2023

chr15murray commented Aug 31, 2023 •

edited

michalg91 commented Aug 31, 2023 •

edited

chr15murray commented Aug 31, 2023

jetstack-bot commented Nov 29, 2023

jetstack-bot commented Dec 29, 2023

eplightning commented Jan 7, 2024

eplightning commented Jan 7, 2024 •

edited

jetstack-bot commented Apr 6, 2024

eplightning commented Apr 6, 2024

Race condition when two identical certificate requests are made from different clusters #6229

Race condition when two identical certificate requests are made from different clusters #6229

Comments

pedrohdz commented Jul 21, 2023

irbekrm commented Jul 21, 2023

michalg91 commented Aug 18, 2023

eplightning commented Aug 28, 2023

chr15murray commented Aug 31, 2023 • edited

michalg91 commented Aug 31, 2023 • edited

chr15murray commented Aug 31, 2023

jetstack-bot commented Nov 29, 2023

jetstack-bot commented Dec 29, 2023

eplightning commented Jan 7, 2024

eplightning commented Jan 7, 2024 • edited

jetstack-bot commented Apr 6, 2024

eplightning commented Apr 6, 2024

chr15murray commented Aug 31, 2023 •

edited

michalg91 commented Aug 31, 2023 •

edited

eplightning commented Jan 7, 2024 •

edited