Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issuer: acme: clouddns: Fix race condition #6088

Merged
merged 1 commit into from Jun 2, 2023

Conversation

cypres
Copy link
Contributor

@cypres cypres commented May 22, 2023

Pull Request Motivation

If you are running multiple cert managers, like in different clusters, they can race against each other when using Google CloudDNS dns01 issuer.

This is because cert-manager removes any previous txt records under _acme-challenge for the FQDN it's trying to issue. This causes a state where cert-manager will have successfully done the Present() call, but then never complete the challenge.

Log lines associated with this would be like:

cert-manager/challenges "msg"="propagation check failed" "error"="DNS record for "foo.example.com" not yet propagated" "dnsName"="foo.example.com" "resource_kind"="Challenge" "resource_name"="foo.example.com-755w2-153119475-769035721" "resource_namespace"="istio-system" "resource_version"="v1" "type"="DNS-01"

To solve this, I have changed the behavior so instead of the issuer removing all existing _acme-challenge it does an atomic update to add the wanted rrdata (txt value) to the record, and during CleanUp() remove it again unless it's the last one, at which points the removes the record.

There can still be races, but instead of getting into a state where it's stuck in propagation check failed, the back-off allows the cert-manager's to quickly reach a state where all the records are added and then removed as needed, and certificate requests are fulfilled.

When you are doing GitOps, where a configuration change is automatically pushed to several clusters at the same time, it's quite common to hit this race condition. Without this PR, cluster admins had to manually intervene and resolve the race condition by removing the certificate and reapplying the kubernetes manifests or similar.

I tested by applying a certificate to three clusters simultaniously with

kubectl apply -f c.yaml --context cluster0 & \
kubectl apply -f c.yaml --context cluster1 & \
kubectl apply -f c.yaml --context cluster2

Kind

bug

Release Note

Fix CloudDNS issuers stuck in propagation check, when multiple instances are issuing for the same FQDN

@jetstack-bot jetstack-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. area/acme Indicates a PR directly modifies the ACME Issuer code area/acme/dns01 Indicates a PR modifies ACME DNS01 provider code needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 22, 2023
@jetstack-bot
Copy link
Collaborator

Hi @cypres. Thanks for your PR.

I'm waiting for a cert-manager member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jetstack-bot jetstack-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 22, 2023
@irbekrm
Copy link
Collaborator

irbekrm commented May 23, 2023

/ok-to-test

@jetstack-bot jetstack-bot added ok-to-test and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 23, 2023
@irbekrm irbekrm self-requested a review May 23, 2023 08:30
Copy link
Collaborator

@irbekrm irbekrm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @cypres .

I agree that this is an issue that we should fix.

I see that the change in this PR makes it so that adding a TXT record, deletes and re-applies any other found records with the same FQDN via the changes API and the cleanup does the same.
I don't know how the changes API is implemented internally, but I am concerned that this could in practice cause a 'downtime' for the records that are re-added (in between them being deleted and added again).
Do you think we could instead use ResourceRecordSets:patch to patch RRSets that contain records that we don't want to delete?

Looking at the docs we already request permissions for dns.resourceRecordSets.* so this should not be a breaking change https://cert-manager.io/docs/configuration/acme/dns01/google/#set-up-a-service-account

@cypres
Copy link
Contributor Author

cypres commented May 31, 2023

Thank you for the feedback, I replied below.

I see that the change in this PR makes it so that adding a TXT record, deletes and re-applies any other found records with the same FQDN via the changes API and the cleanup does the same. I don't know how the changes API is implemented internally, but I am concerned that this could in practice cause a 'downtime' for the records that are re-added (in between them being deleted and added again).

I understand your concern, but that is actually how the changes API is intended to be used. The changes are applied atomically, from the Google documentation;
A Change represents a set of ResourceRecordSet additions and deletions applied atomically to a ManagedZone. ResourceRecordSets within a ManagedZone are modified by creating a new Change element in the Changes collection. In turn the Changes collection also records the past modifications to the ResourceRecordSets in a ManagedZone. The current state of the ManagedZone is the sum effect of applying all Change elements in the Changes collection in sequence.

Do you think we could instead use ResourceRecordSets:patch to patch RRSets that contain records that we don't want to delete?

Sadly, google has no way to patch the rrdata for a TXT record to add or remove a single value.
You can try this yourself with just using the gcloud cli.

# Create a dummy record with value "foo"
$ gcloud dns record-sets create foo.myzone.tld --rrdatas="foo" --ttl=60 --type=TXT --zone=test
NAME                        TYPE  TTL  DATA
foo.myzone.tld.  TXT   60   "foo"
# Patch a partial update
$ gcloud dns record-sets update foo.myzone.tld --rrdatas="bar" --ttl=60 --type=TXT --zone=test
NAME                        TYPE  TTL  DATA
foo.myzone.tld.  TXT   60   "bar"
# Inspect the record
$ gcloud dns record-sets describe foo.myzone.tld --type=TXT --zone=test
NAME                        TYPE  TTL  DATA
foo.myzone.tld.  TXT   60   "bar"

@irbekrm
Copy link
Collaborator

irbekrm commented May 31, 2023

I understand your concern, but that is actually how the changes API is intended to be used. The changes are applied atomically, from the Google documentation;
A Change represents a set of ResourceRecordSet additions and deletions applied atomically to a ManagedZone. ResourceRecordSets within a ManagedZone are modified by creating a new Change element in the Changes collection. In turn the Changes collection also records the past modifications to the ResourceRecordSets in a ManagedZone. The current state of the ManagedZone is the sum effect of applying all Change elements in the Changes collection in sequence.

Thanks for the detailed explanation. Could we please add a comment along the lines of this explanation to the relevant bits of code to prevent confusion in the future?

If running multiple certmanagers they can race against each other

Signed-off-by: Hans Arnholm <hans@arnholm.dk>
@cypres
Copy link
Contributor Author

cypres commented Jun 1, 2023

Thanks for the detailed explanation. Could we please add a comment along the lines of this explanation to the relevant bits of code to prevent confusion in the future?

Thank you. I expanded the comments a bit, let me know if you want something more long-form.

Copy link
Collaborator

@irbekrm irbekrm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @cypres

I tested it with two clusters with a CloudDNS issuer and a cert with the same DNS name for both clusters.
I could reproduce the error (in my case, both Challenges ended up in an invalid state as one of them overwrote the value of another after the self-check had already validated it, the first one then failed ACME validation and deleted the text record which caused the second one to fail.
I saw both issuances succeed with this fix, observed TXT records in the acme challenge record, both issuances succeeded and the record got cleaned up afterwards as expected.

I think that we will not backport this- it is arguably almost like a new feature (we didn't really promise multicluster concurrency before), but it would be good to cut a v1.13 alpha release soon-ish so interested folks could try it out.

(/cc @aww-aww as I imagine that you folks might have seen this issue too and might be interested in the fix)

/lgtm


// Look for existing records.
list, err := c.client.ResourceRecordSets.List(c.project, zone).Name(fqdn).Type("TXT").Do()
if err != nil {
return err
}
if len(list.Rrsets) > 0 {
// Attempt to delete the existing records when adding our new one.
// Merge the existing RR Data into the new one, requires a delete and an add operation, or it will fail.
// The operations are applied atomically to the zone, so there is no point in time where the entire TXT record is deleted.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding that 👍🏼

@jetstack-bot jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label Jun 2, 2023
@jetstack-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cypres, irbekrm, tobiasbrodersen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jetstack-bot jetstack-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 2, 2023
@cert-manager cert-manager deleted a comment from jetstack-bot Jun 2, 2023
@irbekrm irbekrm added the kind/bug Categorizes issue or PR as related to a bug. label Jun 2, 2023
@jetstack-bot jetstack-bot removed the needs-kind Indicates a PR lacks a `kind/foo` label and requires one. label Jun 2, 2023
@cert-manager cert-manager deleted a comment from jetstack-bot Jun 2, 2023
@jetstack-bot jetstack-bot merged commit 7f519b3 into cert-manager:master Jun 2, 2023
6 checks passed
nrdufour added a commit to nrdufour/home-ops that referenced this pull request Sep 12, 2023
This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [cert-manager](https://github.com/cert-manager/cert-manager) | minor | `v1.12.4` -> `v1.13.0` |

---

### Release Notes

<details>
<summary>cert-manager/cert-manager (cert-manager)</summary>

### [`v1.13.0`](https://github.com/cert-manager/cert-manager/releases/tag/v1.13.0)

[Compare Source](cert-manager/cert-manager@v1.12.4...v1.13.0)

cert-manager is the easiest way to automatically manage certificates in Kubernetes and OpenShift clusters.

This is the 1.13 release of cert-manager!

#### Community

Welcome to these new cert-manager members (more info - cert-manager/cert-manager#6260):
[@&#8203;jsoref](https://github.com/jsoref)
[@&#8203;FlorianLiebhart](https://github.com/FlorianLiebhart)
[@&#8203;hawksight](https://github.com/hawksight)
[@&#8203;erikgb](https://github.com/erikgb)

Thanks again to all open-source contributors with commits in this release, including:
[@&#8203;AcidLeroy](https://github.com/AcidLeroy)
[@&#8203;FlorianLiebhart](https://github.com/FlorianLiebhart)
[@&#8203;lucacome](https://github.com/lucacome)
[@&#8203;cypres](https://github.com/cypres)
[@&#8203;erikgb](https://github.com/erikgb)
[@&#8203;ubergesundheit](https://github.com/ubergesundheit)
[@&#8203;jkroepke](https://github.com/jkroepke)
[@&#8203;jsoref](https://github.com/jsoref)
[@&#8203;gdvalle](https://github.com/gdvalle)
[@&#8203;rouke-broersma](https://github.com/rouke-broersma)
[@&#8203;schrodit](https://github.com/schrodit)
[@&#8203;zhangzhiqiangcs](https://github.com/zhangzhiqiangcs)
[@&#8203;arukiidou](https://github.com/arukiidou)
[@&#8203;hawksight](https://github.com/hawksight)
[@&#8203;Richardds](https://github.com/Richardds)
[@&#8203;kahirokunn](https://github.com/kahirokunn)

Thanks also to the following cert-manager maintainers for their contributions during this release:
[@&#8203;SgtCoDFish](https://github.com/SgtCoDFish)
[@&#8203;maelvls](https://github.com/maelvls)
[@&#8203;irbekrm](https://github.com/irbekrm)
[@&#8203;inteon](https://github.com/inteon)

Equally thanks to everyone who provided feedback, helped users and raised issues on Github and Slack and joined our meetings!

Special thanks to [@&#8203;AcidLeroy](https://github.com/AcidLeroy) for adding "load options from a versioned config file" support for the cert-manager controller! This has been on our wishlist for a very long time. (see cert-manager/cert-manager#5337)

Also, thanks a lot to [@&#8203;FlorianLiebhart](https://github.com/FlorianLiebhart) for adding support for DNS over HTTPS for the ACME DNS self-check. This is very useful in case all traffic must be HTTP(S) trafic, eg. when using a HTTPS_PROXY. (see cert-manager/cert-manager#5003)

Thanks also to the [CNCF](https://www.cncf.io/), which provides resources and support, and to the AWS open source team for being good community members and for their maintenance of the [PrivateCA Issuer](https://github.com/cert-manager/aws-privateca-issuer).

In addition, massive thanks to [Venafi](https://www.venafi.com/) for contributing developer time and resources towards the continued maintenance of cert-manager projects.

#### Changes since v1.12.0

##### Feature

-   Add support for logging options to webhook config file. ([#&#8203;6243](cert-manager/cert-manager#6243), [@&#8203;inteon](https://github.com/inteon))
-   Add view permissions to the well-known (Openshift) user-facing `cluster-reader` aggregated cluster role ([#&#8203;6241](cert-manager/cert-manager#6241), [@&#8203;erikgb](https://github.com/erikgb))
-   Certificate Shim: distinguish dns names and ip address in certificate ([#&#8203;6267](cert-manager/cert-manager#6267), [@&#8203;zhangzhiqiangcs](https://github.com/zhangzhiqiangcs))
-   Cmctl can now be imported by third parties. ([#&#8203;6049](cert-manager/cert-manager#6049), [@&#8203;SgtCoDFish](https://github.com/SgtCoDFish))
-   Make `enableServiceLinks` configurable for all Deployments and `startupapicheck` Job in Helm chart. ([#&#8203;6292](cert-manager/cert-manager#6292), [@&#8203;ubergesundheit](https://github.com/ubergesundheit))
-   Promoted the StableCertificateRequestName and SecretsFilteredCaching feature gates to Beta (enabled by default). ([#&#8203;6298](cert-manager/cert-manager#6298), [@&#8203;inteon](https://github.com/inteon))
-   The cert-manager controller options are now configurable using a configuration file. ([#&#8203;5337](cert-manager/cert-manager#5337), [@&#8203;AcidLeroy](https://github.com/AcidLeroy))
-   The pki CertificateTemplate functions now perform validation of the CSR blob, making sure we sign a Certificate that matches the IsCA and (Extended)KeyUsages that are defined in the CertificateRequest resource. ([#&#8203;6199](cert-manager/cert-manager#6199), [@&#8203;inteon](https://github.com/inteon))
-   \[helm] Add prometheus.servicemonitor.endpointAdditionalProperties to define additional properties on a ServiceMonitor endpoint, e.g. relabelings ([#&#8203;6110](cert-manager/cert-manager#6110), [@&#8203;jkroepke](https://github.com/jkroepke))

##### Design

-   DNS over HTTPS (DoH) is now possible for doing the self-checks during the ACME verification.
    The DNS check method to be used is controlled through the command line flag: `--dns01-recursive-nameservers-only=true` in combination with `--dns01-recursive-nameservers=https://<DoH-endpoint>` (e.g. `https://8.8.8.8/dns-query`). It keeps using DNS lookup as a default method. ([#&#8203;5003](cert-manager/cert-manager#5003), [@&#8203;FlorianLiebhart](https://github.com/FlorianLiebhart))

##### Bug or Regression

-   Allow overriding default pdb .minAvailable with .maxUnavailable without setting .minAvailable to null ([#&#8203;6087](cert-manager/cert-manager#6087), [@&#8203;rouke-broersma](https://github.com/rouke-broersma))
-   BUGFIX: `cmctl check api --wait 0` exited without output and exit code 1; we now make sure we perform the API check at least once and return with the correct error code ([#&#8203;6109](cert-manager/cert-manager#6109), [@&#8203;inteon](https://github.com/inteon))
-   BUGFIX: the issuer and certificate-name annotations on a Secret were incorrectly updated when other fields are changed. ([#&#8203;6147](cert-manager/cert-manager#6147), [@&#8203;inteon](https://github.com/inteon))
-   BUGFIX\[cainjector]: 1-character bug was causing invalid log messages and a memory leak ([#&#8203;6232](cert-manager/cert-manager#6232), [@&#8203;inteon](https://github.com/inteon))
-   Fix CloudDNS issuers stuck in propagation check, when multiple instances are issuing for the same FQDN ([#&#8203;6088](cert-manager/cert-manager#6088), [@&#8203;cypres](https://github.com/cypres))
-   Fix indentation of Webhook NetworkPolicy matchLabels in helm chart. ([#&#8203;6220](cert-manager/cert-manager#6220), [@&#8203;ubergesundheit](https://github.com/ubergesundheit))
-   Fixed Cloudflare DNS01 challenge provider race condition when validating multiple domains ([#&#8203;6191](cert-manager/cert-manager#6191), [@&#8203;Richardds](https://github.com/Richardds))
-   Fixes a bug where webhook was pulling in controller's feature gates.
    ⚠️  ⚠️ BREAKING ⚠️ ⚠️ : If you deploy cert-manager using helm and have `.featureGates` value set, the features defined there will no longer be passed to cert-manager webhook, only to cert-manager controller. Use `webhook.featureGates` field instead to define features to be enabled on webhook.
    **Potentially breaking**: If you were, for some reason, passing cert-manager controller's features to webhook's `--feature-gates` flag, this will now break (unless the webhook actually has a feature by that name). ([#&#8203;6093](cert-manager/cert-manager#6093), [@&#8203;irbekrm](https://github.com/irbekrm))
-   Fixes an issue where cert-manager would incorrectly reject two IP addresses as being unequal when they should have compared equal. This would be most noticeable when using an IPv6 address which doesn't match how Go's `net.IP.String()` function would have printed that address. ([#&#8203;6293](cert-manager/cert-manager#6293), [@&#8203;SgtCoDFish](https://github.com/SgtCoDFish))
-   We disabled the `enableServiceLinks` option for our ACME http solver pods, because the option caused the pod to be in a crash loop in a cluster with lot of services. ([#&#8203;6143](cert-manager/cert-manager#6143), [@&#8203;schrodit](https://github.com/schrodit))
-   ⚠️ possibly breaking: Webhook validation of CertificateRequest resources is stricter now: all KeyUsages and ExtendedKeyUsages must be defined directly in the CertificateRequest resource, the encoded CSR can never contain more usages that defined there. ([#&#8203;6182](cert-manager/cert-manager#6182), [@&#8203;inteon](https://github.com/inteon))

##### Other (Cleanup or Flake)

-   A subset of the klogs flags have been deprecated and will be removed in the future. ([#&#8203;5879](cert-manager/cert-manager#5879), [@&#8203;maelvls](https://github.com/maelvls))
-   All service links in helm chart deployments have been disabled. ([#&#8203;6144](cert-manager/cert-manager#6144), [@&#8203;schrodit](https://github.com/schrodit))
-   Cert-manager will now re-issue a certificate if the public key in the latest CertificateRequest resource linked to a Certificate resource does not match the public key of the key encoded in the Secret linked to that Certificate resource ([#&#8203;6168](cert-manager/cert-manager#6168), [@&#8203;inteon](https://github.com/inteon))
-   Chore: When hostNetwork is enabled, dnsPolicy is now set to ClusterFirstWithHostNet. ([#&#8203;6156](cert-manager/cert-manager#6156), [@&#8203;kahirokunn](https://github.com/kahirokunn))
-   Cleanup the controller configfile structure by introducing sub-structs. ([#&#8203;6242](cert-manager/cert-manager#6242), [@&#8203;inteon](https://github.com/inteon))
-   Don't run API Priority and Fairness controller in webhook's extension apiserver ([#&#8203;6085](cert-manager/cert-manager#6085), [@&#8203;irbekrm](https://github.com/irbekrm))
-   Helm: Add apache 2.0 license annotation ([#&#8203;6225](cert-manager/cert-manager#6225), [@&#8203;arukiidou](https://github.com/arukiidou))
-   Make apis/acme/v1/ACMEIssuer.PreferredChain optional in JSON serialization. ([#&#8203;6034](cert-manager/cert-manager#6034), [@&#8203;gdvalle](https://github.com/gdvalle))
-   The SecretPostIssuancePolicyChain now also makes sure that the `cert-manager.io/common-name`, `cert-manager.io/alt-names`, ... annotations on Secrets are kept at their correct value. ([#&#8203;6176](cert-manager/cert-manager#6176), [@&#8203;inteon](https://github.com/inteon))
-   The cmctl logging has been improved and support for json logging has been added. ([#&#8203;6247](cert-manager/cert-manager#6247), [@&#8203;inteon](https://github.com/inteon))
-   Updates Kubernetes libraries to `v0.27.2`. ([#&#8203;6077](cert-manager/cert-manager#6077), [@&#8203;lucacome](https://github.com/lucacome))
-   Updates Kubernetes libraries to `v0.27.4`. ([#&#8203;6227](cert-manager/cert-manager#6227), [@&#8203;lucacome](https://github.com/lucacome))
-   We now only check that the issuer name, kind and group annotations on a Secret match in case those annotations are set. ([#&#8203;6152](cert-manager/cert-manager#6152), [@&#8203;inteon](https://github.com/inteon))

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNi4yMy4yIiwidXBkYXRlZEluVmVyIjoiMzYuMjMuMiIsInRhcmdldEJyYW5jaCI6Im1haW4ifQ==-->

Reviewed-on: https://git.home/nrdufour/home-ops/pulls/84
Co-authored-by: Renovate <renovate@ptinem.io>
Co-committed-by: Renovate <renovate@ptinem.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/acme/dns01 Indicates a PR modifies ACME DNS01 provider code area/acme Indicates a PR directly modifies the ACME Issuer code dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. ok-to-test release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants