New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
issuer: acme: clouddns: Fix race condition #6088
Conversation
Hi @cypres. Thanks for your PR. I'm waiting for a cert-manager member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this @cypres .
I agree that this is an issue that we should fix.
I see that the change in this PR makes it so that adding a TXT record, deletes and re-applies any other found records with the same FQDN via the changes API and the cleanup does the same.
I don't know how the changes API is implemented internally, but I am concerned that this could in practice cause a 'downtime' for the records that are re-added (in between them being deleted and added again).
Do you think we could instead use ResourceRecordSets:patch
to patch RRSets that contain records that we don't want to delete?
Looking at the docs we already request permissions for dns.resourceRecordSets.*
so this should not be a breaking change https://cert-manager.io/docs/configuration/acme/dns01/google/#set-up-a-service-account
Thank you for the feedback, I replied below.
I understand your concern, but that is actually how the changes API is intended to be used. The changes are applied atomically, from the Google documentation;
Sadly, google has no way to patch the rrdata for a TXT record to add or remove a single value. # Create a dummy record with value "foo"
$ gcloud dns record-sets create foo.myzone.tld --rrdatas="foo" --ttl=60 --type=TXT --zone=test
NAME TYPE TTL DATA
foo.myzone.tld. TXT 60 "foo"
# Patch a partial update
$ gcloud dns record-sets update foo.myzone.tld --rrdatas="bar" --ttl=60 --type=TXT --zone=test
NAME TYPE TTL DATA
foo.myzone.tld. TXT 60 "bar"
# Inspect the record
$ gcloud dns record-sets describe foo.myzone.tld --type=TXT --zone=test
NAME TYPE TTL DATA
foo.myzone.tld. TXT 60 "bar" |
Thanks for the detailed explanation. Could we please add a comment along the lines of this explanation to the relevant bits of code to prevent confusion in the future? |
If running multiple certmanagers they can race against each other Signed-off-by: Hans Arnholm <hans@arnholm.dk>
Thank you. I expanded the comments a bit, let me know if you want something more long-form. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this @cypres
I tested it with two clusters with a CloudDNS issuer and a cert with the same DNS name for both clusters.
I could reproduce the error (in my case, both Challenge
s ended up in an invalid state as one of them overwrote the value of another after the self-check had already validated it, the first one then failed ACME validation and deleted the text record which caused the second one to fail.
I saw both issuances succeed with this fix, observed TXT records in the acme challenge record, both issuances succeeded and the record got cleaned up afterwards as expected.
I think that we will not backport this- it is arguably almost like a new feature (we didn't really promise multicluster concurrency before), but it would be good to cut a v1.13 alpha release soon-ish so interested folks could try it out.
(/cc @aww-aww as I imagine that you folks might have seen this issue too and might be interested in the fix)
/lgtm
|
||
// Look for existing records. | ||
list, err := c.client.ResourceRecordSets.List(c.project, zone).Name(fqdn).Type("TXT").Do() | ||
if err != nil { | ||
return err | ||
} | ||
if len(list.Rrsets) > 0 { | ||
// Attempt to delete the existing records when adding our new one. | ||
// Merge the existing RR Data into the new one, requires a delete and an add operation, or it will fail. | ||
// The operations are applied atomically to the zone, so there is no point in time where the entire TXT record is deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding that 👍🏼
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cypres, irbekrm, tobiasbrodersen The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR contains the following updates: | Package | Update | Change | |---|---|---| | [cert-manager](https://github.com/cert-manager/cert-manager) | minor | `v1.12.4` -> `v1.13.0` | --- ### Release Notes <details> <summary>cert-manager/cert-manager (cert-manager)</summary> ### [`v1.13.0`](https://github.com/cert-manager/cert-manager/releases/tag/v1.13.0) [Compare Source](cert-manager/cert-manager@v1.12.4...v1.13.0) cert-manager is the easiest way to automatically manage certificates in Kubernetes and OpenShift clusters. This is the 1.13 release of cert-manager! #### Community Welcome to these new cert-manager members (more info - cert-manager/cert-manager#6260): [@​jsoref](https://github.com/jsoref) [@​FlorianLiebhart](https://github.com/FlorianLiebhart) [@​hawksight](https://github.com/hawksight) [@​erikgb](https://github.com/erikgb) Thanks again to all open-source contributors with commits in this release, including: [@​AcidLeroy](https://github.com/AcidLeroy) [@​FlorianLiebhart](https://github.com/FlorianLiebhart) [@​lucacome](https://github.com/lucacome) [@​cypres](https://github.com/cypres) [@​erikgb](https://github.com/erikgb) [@​ubergesundheit](https://github.com/ubergesundheit) [@​jkroepke](https://github.com/jkroepke) [@​jsoref](https://github.com/jsoref) [@​gdvalle](https://github.com/gdvalle) [@​rouke-broersma](https://github.com/rouke-broersma) [@​schrodit](https://github.com/schrodit) [@​zhangzhiqiangcs](https://github.com/zhangzhiqiangcs) [@​arukiidou](https://github.com/arukiidou) [@​hawksight](https://github.com/hawksight) [@​Richardds](https://github.com/Richardds) [@​kahirokunn](https://github.com/kahirokunn) Thanks also to the following cert-manager maintainers for their contributions during this release: [@​SgtCoDFish](https://github.com/SgtCoDFish) [@​maelvls](https://github.com/maelvls) [@​irbekrm](https://github.com/irbekrm) [@​inteon](https://github.com/inteon) Equally thanks to everyone who provided feedback, helped users and raised issues on Github and Slack and joined our meetings! Special thanks to [@​AcidLeroy](https://github.com/AcidLeroy) for adding "load options from a versioned config file" support for the cert-manager controller! This has been on our wishlist for a very long time. (see cert-manager/cert-manager#5337) Also, thanks a lot to [@​FlorianLiebhart](https://github.com/FlorianLiebhart) for adding support for DNS over HTTPS for the ACME DNS self-check. This is very useful in case all traffic must be HTTP(S) trafic, eg. when using a HTTPS_PROXY. (see cert-manager/cert-manager#5003) Thanks also to the [CNCF](https://www.cncf.io/), which provides resources and support, and to the AWS open source team for being good community members and for their maintenance of the [PrivateCA Issuer](https://github.com/cert-manager/aws-privateca-issuer). In addition, massive thanks to [Venafi](https://www.venafi.com/) for contributing developer time and resources towards the continued maintenance of cert-manager projects. #### Changes since v1.12.0 ##### Feature - Add support for logging options to webhook config file. ([#​6243](cert-manager/cert-manager#6243), [@​inteon](https://github.com/inteon)) - Add view permissions to the well-known (Openshift) user-facing `cluster-reader` aggregated cluster role ([#​6241](cert-manager/cert-manager#6241), [@​erikgb](https://github.com/erikgb)) - Certificate Shim: distinguish dns names and ip address in certificate ([#​6267](cert-manager/cert-manager#6267), [@​zhangzhiqiangcs](https://github.com/zhangzhiqiangcs)) - Cmctl can now be imported by third parties. ([#​6049](cert-manager/cert-manager#6049), [@​SgtCoDFish](https://github.com/SgtCoDFish)) - Make `enableServiceLinks` configurable for all Deployments and `startupapicheck` Job in Helm chart. ([#​6292](cert-manager/cert-manager#6292), [@​ubergesundheit](https://github.com/ubergesundheit)) - Promoted the StableCertificateRequestName and SecretsFilteredCaching feature gates to Beta (enabled by default). ([#​6298](cert-manager/cert-manager#6298), [@​inteon](https://github.com/inteon)) - The cert-manager controller options are now configurable using a configuration file. ([#​5337](cert-manager/cert-manager#5337), [@​AcidLeroy](https://github.com/AcidLeroy)) - The pki CertificateTemplate functions now perform validation of the CSR blob, making sure we sign a Certificate that matches the IsCA and (Extended)KeyUsages that are defined in the CertificateRequest resource. ([#​6199](cert-manager/cert-manager#6199), [@​inteon](https://github.com/inteon)) - \[helm] Add prometheus.servicemonitor.endpointAdditionalProperties to define additional properties on a ServiceMonitor endpoint, e.g. relabelings ([#​6110](cert-manager/cert-manager#6110), [@​jkroepke](https://github.com/jkroepke)) ##### Design - DNS over HTTPS (DoH) is now possible for doing the self-checks during the ACME verification. The DNS check method to be used is controlled through the command line flag: `--dns01-recursive-nameservers-only=true` in combination with `--dns01-recursive-nameservers=https://<DoH-endpoint>` (e.g. `https://8.8.8.8/dns-query`). It keeps using DNS lookup as a default method. ([#​5003](cert-manager/cert-manager#5003), [@​FlorianLiebhart](https://github.com/FlorianLiebhart)) ##### Bug or Regression - Allow overriding default pdb .minAvailable with .maxUnavailable without setting .minAvailable to null ([#​6087](cert-manager/cert-manager#6087), [@​rouke-broersma](https://github.com/rouke-broersma)) - BUGFIX: `cmctl check api --wait 0` exited without output and exit code 1; we now make sure we perform the API check at least once and return with the correct error code ([#​6109](cert-manager/cert-manager#6109), [@​inteon](https://github.com/inteon)) - BUGFIX: the issuer and certificate-name annotations on a Secret were incorrectly updated when other fields are changed. ([#​6147](cert-manager/cert-manager#6147), [@​inteon](https://github.com/inteon)) - BUGFIX\[cainjector]: 1-character bug was causing invalid log messages and a memory leak ([#​6232](cert-manager/cert-manager#6232), [@​inteon](https://github.com/inteon)) - Fix CloudDNS issuers stuck in propagation check, when multiple instances are issuing for the same FQDN ([#​6088](cert-manager/cert-manager#6088), [@​cypres](https://github.com/cypres)) - Fix indentation of Webhook NetworkPolicy matchLabels in helm chart. ([#​6220](cert-manager/cert-manager#6220), [@​ubergesundheit](https://github.com/ubergesundheit)) - Fixed Cloudflare DNS01 challenge provider race condition when validating multiple domains ([#​6191](cert-manager/cert-manager#6191), [@​Richardds](https://github.com/Richardds)) - Fixes a bug where webhook was pulling in controller's feature gates.⚠️ ⚠️ BREAKING⚠️ ⚠️ : If you deploy cert-manager using helm and have `.featureGates` value set, the features defined there will no longer be passed to cert-manager webhook, only to cert-manager controller. Use `webhook.featureGates` field instead to define features to be enabled on webhook. **Potentially breaking**: If you were, for some reason, passing cert-manager controller's features to webhook's `--feature-gates` flag, this will now break (unless the webhook actually has a feature by that name). ([#​6093](cert-manager/cert-manager#6093), [@​irbekrm](https://github.com/irbekrm)) - Fixes an issue where cert-manager would incorrectly reject two IP addresses as being unequal when they should have compared equal. This would be most noticeable when using an IPv6 address which doesn't match how Go's `net.IP.String()` function would have printed that address. ([#​6293](cert-manager/cert-manager#6293), [@​SgtCoDFish](https://github.com/SgtCoDFish)) - We disabled the `enableServiceLinks` option for our ACME http solver pods, because the option caused the pod to be in a crash loop in a cluster with lot of services. ([#​6143](cert-manager/cert-manager#6143), [@​schrodit](https://github.com/schrodit)) -⚠️ possibly breaking: Webhook validation of CertificateRequest resources is stricter now: all KeyUsages and ExtendedKeyUsages must be defined directly in the CertificateRequest resource, the encoded CSR can never contain more usages that defined there. ([#​6182](cert-manager/cert-manager#6182), [@​inteon](https://github.com/inteon)) ##### Other (Cleanup or Flake) - A subset of the klogs flags have been deprecated and will be removed in the future. ([#​5879](cert-manager/cert-manager#5879), [@​maelvls](https://github.com/maelvls)) - All service links in helm chart deployments have been disabled. ([#​6144](cert-manager/cert-manager#6144), [@​schrodit](https://github.com/schrodit)) - Cert-manager will now re-issue a certificate if the public key in the latest CertificateRequest resource linked to a Certificate resource does not match the public key of the key encoded in the Secret linked to that Certificate resource ([#​6168](cert-manager/cert-manager#6168), [@​inteon](https://github.com/inteon)) - Chore: When hostNetwork is enabled, dnsPolicy is now set to ClusterFirstWithHostNet. ([#​6156](cert-manager/cert-manager#6156), [@​kahirokunn](https://github.com/kahirokunn)) - Cleanup the controller configfile structure by introducing sub-structs. ([#​6242](cert-manager/cert-manager#6242), [@​inteon](https://github.com/inteon)) - Don't run API Priority and Fairness controller in webhook's extension apiserver ([#​6085](cert-manager/cert-manager#6085), [@​irbekrm](https://github.com/irbekrm)) - Helm: Add apache 2.0 license annotation ([#​6225](cert-manager/cert-manager#6225), [@​arukiidou](https://github.com/arukiidou)) - Make apis/acme/v1/ACMEIssuer.PreferredChain optional in JSON serialization. ([#​6034](cert-manager/cert-manager#6034), [@​gdvalle](https://github.com/gdvalle)) - The SecretPostIssuancePolicyChain now also makes sure that the `cert-manager.io/common-name`, `cert-manager.io/alt-names`, ... annotations on Secrets are kept at their correct value. ([#​6176](cert-manager/cert-manager#6176), [@​inteon](https://github.com/inteon)) - The cmctl logging has been improved and support for json logging has been added. ([#​6247](cert-manager/cert-manager#6247), [@​inteon](https://github.com/inteon)) - Updates Kubernetes libraries to `v0.27.2`. ([#​6077](cert-manager/cert-manager#6077), [@​lucacome](https://github.com/lucacome)) - Updates Kubernetes libraries to `v0.27.4`. ([#​6227](cert-manager/cert-manager#6227), [@​lucacome](https://github.com/lucacome)) - We now only check that the issuer name, kind and group annotations on a Secret match in case those annotations are set. ([#​6152](cert-manager/cert-manager#6152), [@​inteon](https://github.com/inteon)) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNi4yMy4yIiwidXBkYXRlZEluVmVyIjoiMzYuMjMuMiIsInRhcmdldEJyYW5jaCI6Im1haW4ifQ==--> Reviewed-on: https://git.home/nrdufour/home-ops/pulls/84 Co-authored-by: Renovate <renovate@ptinem.io> Co-committed-by: Renovate <renovate@ptinem.io>
Pull Request Motivation
If you are running multiple cert managers, like in different clusters, they can race against each other when using Google CloudDNS dns01 issuer.
This is because cert-manager removes any previous txt records under _acme-challenge for the FQDN it's trying to issue. This causes a state where cert-manager will have successfully done the Present() call, but then never complete the challenge.
Log lines associated with this would be like:
To solve this, I have changed the behavior so instead of the issuer removing all existing _acme-challenge it does an atomic update to add the wanted rrdata (txt value) to the record, and during CleanUp() remove it again unless it's the last one, at which points the removes the record.
There can still be races, but instead of getting into a state where it's stuck in propagation check failed, the back-off allows the cert-manager's to quickly reach a state where all the records are added and then removed as needed, and certificate requests are fulfilled.
When you are doing GitOps, where a configuration change is automatically pushed to several clusters at the same time, it's quite common to hit this race condition. Without this PR, cluster admins had to manually intervene and resolve the race condition by removing the certificate and reapplying the kubernetes manifests or similar.
I tested by applying a certificate to three clusters simultaniously with
Kind
bug
Release Note