Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add health probe that detects skew between system clock and monotonic go process clock #6328

Merged
merged 3 commits into from
Sep 27, 2023

Conversation

inteon
Copy link
Member

@inteon inteon commented Sep 11, 2023

Pull Request Motivation

This PR adds a new LivenessProbe that makes sure the clock skew is not more than 5 minutes.
If we detect that the monotonic duration since the start of the controller differs more than 5 minutes from the system clock duration since the start of the project, we return HTTP error code 500 and force a restart.

This check will fix the following issues:

  • if the controller was started at time A & a timer for time B is created, hibernating for duration h will cause the timer to trigger at time B + h
  • if the controller was started at time A & a timer for time B is created, changing the system clock (eg. fixing a wrong delta using ntp) will still result in the timer triggering at the original time

See golang/go#35012 for the background on this issue in upstream Go.

fixes #6341

Kind

/kind feature

Release Note

Added a clock skew detector liveness probe that will force a restart in case we detect a skew between the internal monotonic clock and the system clock of more than 5 minutes.
Also, the controller's liveness probe is now enabled by default.

@jetstack-bot jetstack-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 11, 2023
@inteon inteon force-pushed the add_clock_health branch 3 times, most recently from 525f980 to af38894 Compare September 13, 2023 14:09
@jetstack-bot jetstack-bot added the area/deploy Indicates a PR modifies deployment configuration label Sep 13, 2023
@inteon inteon changed the title Add health probe that detects skew between 'real' clock and 'monotonic' clock Add health probe that detects skew between system clock and monotonic go process clock Sep 14, 2023
…onotonic' internal clock

Signed-off-by: Tim Ramlot <42113979+inteon@users.noreply.github.com>
Copy link
Member

@SgtCoDFish SgtCoDFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel like I've written a lot here, but I like the approach generally. I think there are a few subtle bits here which might need a little tweaking though - what do you think?

deploy/charts/cert-manager/values.yaml Show resolved Hide resolved
pkg/healthz/clock_health.go Outdated Show resolved Hide resolved
pkg/healthz/clock_health.go Show resolved Hide resolved
pkg/healthz/clock_health.go Outdated Show resolved Hide resolved
pkg/healthz/clock_health.go Outdated Show resolved Hide resolved
pkg/healthz/clock_health.go Outdated Show resolved Hide resolved
pkg/healthz/clock_health.go Outdated Show resolved Hide resolved
Signed-off-by: Tim Ramlot <42113979+inteon@users.noreply.github.com>
Comment on lines 29 to 46
// The clockHealthAdaptor implements the HealthChecker interface.
// It checks the system clock is in sync with the internal monotonic clock.
// This is important because the internal monotonic clock is used to trigger certificate
// reconciles for renewals. If the monotonic clock is out of sync with the system clock
// then renewals might not be triggered in time. Ideally we would trigger renewals based
// on the system clock, but this is not (yet) possible in Go.
// See https://github.com/golang/go/issues/35012
//
// A clock skew can be caused by:
// 1. The system clock being adjusted
// -> this eg. happens when ntp adjusts the system clock
// 2. Pausing the process (e.g. with SIGSTOP)
// -> the monotonic clock will stop, but the system clock will continue
// -> this eg. happens when you pause a VM/ hibernate a laptop
//
// Small clock skews of < 1m are allowed, because they can happen when the system clock is
// adjusted. However, we do compound the clock skew over time, so that if the clock skew
// is small but constant, it will eventually fail the health check.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderfully written!

Copy link
Member

@SgtCoDFish SgtCoDFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
/hold

I think this looks great, the comments you've added are fantastic. I'm adding a hold for someone else to review this, since someone else might have useful input on the fact this is a livenessProbe.

I don't have a problem with it being a livenessProbe, but I might have missed a reason why it shouldn't be implemented in this way. Still, I think this would be fine to merge - we can always revert before 1.14 if someone realises this is a problem!

@jetstack-bot jetstack-bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Sep 20, 2023
@jetstack-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: SgtCoDFish

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jetstack-bot jetstack-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 20, 2023
@maelvls
Copy link
Member

maelvls commented Sep 26, 2023

I couldn't find the GitHub issue for this PR, so I'll share my thought here.

We return 500 and force a restart.

Why do we bother restarting the whole process with a liveness probe when we could simply re-queue all the certificates and be done? It seems like the liveness probe is overkill.

Also, is 500 ms arbitrary or does it related to something we observed?

Copy link
Member

@wallrj wallrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not agree with this approach but I'm not blocking it.

  1. I think liveness probes should be reserved for recovering from problems which cause cert-manager to stop reconciling in a production environment.

  2. I do not think there is sufficient justification here for enabling liveness probes by default. Are we sure that the existing liveness probe is and healthz endpoints are production ready? The plan had always been to get user feedback before enabling them by default. https://cert-manager.io/docs/installation/best-practice/#controller

image
  1. As far as I remember, this problem affects a user who is testing cert-manager on a laptop and using a certificates with a lifetime where the calculated renewBefore time happens to be shorter than the time that the laptop was suspended. I think the risks of introducing this new (untested) code outweigh the benefit of fixing that specific problem.

  2. I still want to know how other operators that use Go time.After address this issue and indeed Kubernetes its self? How does the Kind Kubernetes API server refresh a ServiceAccount token on time, after it resumes from suspend.

@inteon
Copy link
Member Author

inteon commented Sep 26, 2023

@maelvls and @wallrj: The reason for using a LivenessProbe to solve this problem is that this is by far the simplest solution to implement that fixes this problem. Also, this approach will provide feedback (in terms of the restart counter) that something went wrong with cert-manager. The Liveness probe indicates that something is wrong with the cert-manager process and that a restart will fix the problem. A skewed clock should be something that does not happen frequently, that is why I think that restarting the process is good enough.

@wallrj
1 & 3: As described in the PR description, this is not just a problem on laptops, it also happens on "paused" VMs and maybe on starving CPUs. Also, I think we should make sure that cert-manager works on laptops too, so developers don't get the impression that cert-manager fails to renew certs.

2: We already have a liveness probe for the webhook, so enabling a webhook for the controller is not a big deal I think.

3: The code has been manually tested (by multiple parties), but is very hard to test automatically (would require changing the system clock or altering the go runtime). The implementation is very simple, so there shouldn't be much room for errors.

4: I think apps that have long-running time.After delays are affected more by this bug (because you might get delays that trigger N days too late, where N = min(hibernate time, time.After time) ). Also, it would be great if we were not the first point of failure.

@maelvls
Copy link
Member

maelvls commented Sep 26, 2023

enabling the liveness probe for the controller is not a big deal

It seems like our liveness probe documentation thinks otherwise; it explains that the leader election liveness probe shouldn't be needed "in theory" but that you may want to use it "in practice":

The liveness probe should only be needed if there is a bug in this orderly shutdown process, or if there is a bug in one of the other threads which causes the process to deadlock and not shutdown. You may want to enable the liveness probe anyway, for defense against unforeseen bugs and deadlocks, but you will need to monitor the processes closely and, tweak the various liveness probe time settings and thresholds, if necessary.

If we are about to use livenessProbe: true for everyone, let's rephrase this paragraph to explain why the leader election liveness probe exists:

"Although we tried our best to have each goroutine gracefully stop after the leader election fails, the many possible code paths make it hard to be certain that cert-manager will exit. If cert-manager doesn't exit and silently becomes a zombie, there is a high chance that will cause an outage since the recommended number of replicas is 1 and Kubernetes won't do anything to restart cert-manager."

(I found #5889 where cainjector ended up in a zombie state while trying to stop gracefully; that could have been prevented by a liveness check)

Now that some time has passed since we released the leader election liveness probe in 1.12, are we now feeling more confident about switching this on for everyone? I have searched for "liveness probe" in Slack and on GitHub and I did not find anyone complaining, although I wouldn't be able to know if anyone has actually used it.

@inteon
Copy link
Member Author

inteon commented Sep 27, 2023

Ok, I think that in combination with updating the website documentation, this PR can be merged?

Signed-off-by: Tim Ramlot <42113979+inteon@users.noreply.github.com>
@jetstack-bot jetstack-bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 27, 2023
@inteon inteon requested a review from wallrj September 27, 2023 09:23
@inteon
Copy link
Member Author

inteon commented Sep 27, 2023

I also updated the max skew to 5 minutes (instead of 1 minute) to reduce the risk of unwanted restarts on weird systems: 5049cd4.

@SgtCoDFish
Copy link
Member

/lgtm

@jetstack-bot jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 27, 2023
@inteon
Copy link
Member Author

inteon commented Sep 27, 2023

/unhold

@jetstack-bot jetstack-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 27, 2023
@jetstack-bot jetstack-bot merged commit 8aafddb into cert-manager:master Sep 27, 2023
6 checks passed
nrdufour added a commit to nrdufour/home-ops that referenced this pull request Feb 4, 2024
This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [cert-manager](https://github.com/cert-manager/cert-manager) | minor | `v1.13.3` -> `v1.14.1` |

---

### Release Notes

<details>
<summary>cert-manager/cert-manager (cert-manager)</summary>

### [`v1.14.1`](https://github.com/cert-manager/cert-manager/releases/tag/v1.14.1)

[Compare Source](cert-manager/cert-manager@v1.14.0...v1.14.1)

cert-manager is the easiest way to automatically manage certificates in Kubernetes and OpenShift clusters.

cert-manager 1.14 brings a variety of features, security improvements and bug fixes, including: support for creating X.509 certificates with "Other Name" fields, and support for creating CA certificates with "Name Constraints" and "Authority Information Accessors" extensions.

> 📢 cert-manager `v1.14.1` fixes bugs found *during* the release of `v1.14.0`.
>
> When upgrading to cert-manager release 1.14, please skip `v1.14.0` and install this patch version instead.

##### Documentation

-   [Release notes](https://cert-manager.io/docs/releases/release-notes/release-notes-1.14)
-   [Upgrade notes](https://cert-manager.io/docs/releases/upgrading/upgrading-1.13-1.14)
-   [Installation instructions](https://cert-manager.io/docs/installation/)

##### Changes since `v1.14.0`

##### Bug or Regression

-   Fix broken cainjector image value in Helm chart ([#&#8203;6693](cert-manager/cert-manager#6693), [@&#8203;SgtCoDFish](https://github.com/SgtCoDFish))
-   Fix bug in cmctl namespace detection which prevented it being used as a startupapicheck image in namespaces other than cert-manager. ([#&#8203;6706](cert-manager/cert-manager#6706), [@&#8203;inteon](https://github.com/inteon))
-   Fix bug in cmctl which caused `cmctl experimental install` to panic. ([#&#8203;6706](cert-manager/cert-manager#6706), [@&#8203;inteon](https://github.com/inteon))

### [`v1.14.0`](https://github.com/cert-manager/cert-manager/releases/tag/v1.14.0)

[Compare Source](cert-manager/cert-manager@v1.13.3...v1.14.0)

cert-manager is the easiest way to automatically manage certificates in Kubernetes and OpenShift clusters.

cert-manager 1.14 brings a variety of features, security improvements and bug fixes, including: support for creating X.509 certificates with "Other Name" fields, and support for creating CA certificates with "Name Constraints" and "Authority Information Accessors" extensions.

> ⚠️ This version has known issues. Please install `v1.14.1` instead.
>
> During the release of `v1.14.0`, the Helm chart was found to use the wrong OCI image for the `cainjector` Deployment,
> which caused the Helm installation and the static manifest based installation to fail.
> Upon discovery of this bug, the release of `v1.14.0` was paused before the Helm chart or GitHub release were published;
> but the Git tag and the OCI images had already been published.
>
> The cert-manager team next fixed the Helm chart and two other bugs which are listed in the "Known Issues" section below,
> and then released `v1.14.1`, which is the version that users are strongly advised to install when they upgrade to 1.14.
>
> In order to complete the stalled `v1.14.0` release,
> the Helm chart and static YAML installation files were regenerated on a team member's laptop,
> using exactly the same build scripts as are used in the automated release process,
> and using the `v1.14.1` version of the code.
> The working  `v1.14.0` Helm chart was published,
> and the working versions of the static manifest files attached to the draft `v1.14.0` GitHub release,
> and that was then published.
>
> For these reasons, users are strongly advised to skip this version and install the `v1.14.1` Helm chart instead.

##### Known Issues

-   During the release of `v1.14.0`, the Helm chart for this version was found to use the wrong OCI image for the `cainjector` Deployment,
    which caused the Helm installation to fail.
    In order to complete the release, the cert-manager team have manually updated the Helm chart for this version,
    which contains all the Helm chart fixes which are in `v1.14.1`.
    But users are strongly advised to skip this version and install the `v1.14.1` Helm chart instead.
-   A bug in cmctl namespace detection prevents it being used as a `startupapicheck` image in namespaces other than cert-manager.
-   A bug in cmctl causes `cmctl experimental install` to panic.

##### Breaking Changes

The startupapicheck job uses a new OCI image called "startupapicheck", instead of the ctl image.
If you run in an environment in which images cannot be pulled, be sure to include the new image.

The KeyUsage and BasicConstraints extensions will now be encoded as critical in the CertificateRequest's CSR blob.

##### Major Themes

##### New X.509 Features

The cert-manager Certificate resource now allows you to configure a subset of "Other Name" SANs,
which are described in the [Subject Alternative Name section of RFC 5280](https://datatracker.ietf.org/doc/html/rfc5280#section-4.2.1.6) (on page 37).

We specifically support any `otherName` type with a `UTF-8` value, such as the [User Principal Name](https://docs.venafi.com/Docs/current/TopNav/Content/Certificates/r-UEP-support-SANs.php) or [`sAMAccountName`](https://learn.microsoft.com/en-us/windows/win32/ad/naming-properties).
These are useful when issuing unique certificates for authenticating with LDAP systems such as Microsoft Active Directory.
For example you can create certificates with this block in the spec:

      otherNames:
        - oid: 1.3.6.1.4.1.311.20.2.3 # UPN OID
          utf8Value: upn@domain.local

The feature is still in alpha stage and requires you to [enable the `OtherName` feature flag in the controller and webhook components](../../installation/configuring-components.md#feature-gates).

##### New CA certificate Features

You can now specify the X.509 v3 Authority Information Accessors extension,
with URLs for certificates issued by the CA issuer.

Users can now use name constraints in CA certificates.
To know more details on name constraints check out RFC section https://datatracker.ietf.org/doc/html/rfc5280#section-4.2.1.10

##### Security

An ongoing security audit of the cert-manager code revealed some weaknesses which we have addressed in this release,
such as using more secure default settings in the HTTP servers that serve metrics, healthz and pprof endpoints.
This will help mitigate denial-of-service attacks against those important services.

All the cert-manager containers are now configured with read only root file system by default,
to prevent unexpected changes to the file system of the OCI image.

And it is now possible to configure the metrics server to use HTTPS rather than HTTP,
so that clients can verify the identity of the metrics server.

##### Other

The liveness probe of the cert-manager controller Pod is now enabled by default.

There is a new option `.spec.keystores.pkcs12.algorithms` to specify encryption and MAC algorithms for PKCS.

##### Community

Thanks again to all open-source contributors with commits in this release, including:

-   [@&#8203;ABWassim](https://github.com/ABWassim)
-   [@&#8203;JoeNorth](https://github.com/JoeNorth)
-   [@&#8203;allenmunC1](https://github.com/allenmunC1)
-   [@&#8203;asapekia](https://github.com/asapekia)
-   [@&#8203;jeremycampbell](https://github.com/jeremycampbell)
-   [@&#8203;jkroepke](https://github.com/jkroepke)
-   [@&#8203;jsoref](https://github.com/jsoref)
-   [@&#8203;lauraseidler](https://github.com/lauraseidler)
-   [@&#8203;pevidex](https://github.com/pevidex)
-   [@&#8203;phillebaba](https://github.com/phillebaba)
-   [@&#8203;snorwin](https://github.com/snorwin)
-   [@&#8203;tanujd11](https://github.com/tanujd11)
-   [@&#8203;tberreis](https://github.com/tberreis)
-   [@&#8203;vinny](https://github.com/vinny)

Thanks also to the following cert-manager maintainers for their contributions during this release:

-   [@&#8203;SgtCoDFish](https://github.com/SgtCoDFish)
-   [@&#8203;SpectralHiss](https://github.com/SpectralHiss)
-   [@&#8203;ThatsMrTalbot](https://github.com/ThatsMrTalbot)
-   [@&#8203;hawksight](https://github.com/hawksight)
-   [@&#8203;inteon](https://github.com/inteon)
-   [@&#8203;maelvls](https://github.com/maelvls)
-   [@&#8203;wallrj](https://github.com/wallrj)

Equally thanks to everyone who provided feedback, helped users and raised issues on GitHub and Slack and joined our meetings!

Thanks also to the [CNCF](https://www.cncf.io/), which provides resources and support, and to the AWS open source team for being good community members and for their maintenance of the [PrivateCA Issuer](https://github.com/cert-manager/aws-privateca-issuer).

In addition, massive thanks to [Venafi](https://www.venafi.com/) for contributing developer time and resources towards the continued maintenance of cert-manager projects.

##### Changes

##### Feature

-   ACME challenge solver Pod for HTTP01 will get a default annotation of `"cluster-autoscaler.kubernetes.io/safe-to-evict": "true"`. You can provide an annotation of `"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"` in your `podTemplate` if you don't like this. ([#&#8203;6349](cert-manager/cert-manager#6349), [@&#8203;jsoref](https://github.com/jsoref))
-   Added a clock skew detector liveness probe that will force a restart in case we detect a skew between the internal monotonic clock and the system clock of more than 5 minutes.
    Also, the controller's liveness probe is now enabled by default. ([#&#8203;6328](cert-manager/cert-manager#6328), [@&#8203;inteon](https://github.com/inteon))
-   Added a new flag (--dynamic-serving-leaf-duration) that can adjust the lifetime of the dynamic leaf certificates ([#&#8203;6552](cert-manager/cert-manager#6552), [@&#8203;allenmunC1](https://github.com/allenmunC1))
-   Added support for `otherName` SANS in Certificates ([#&#8203;6404](cert-manager/cert-manager#6404), [@&#8203;SpectralHiss](https://github.com/SpectralHiss))
-   Added the option to specify the  X.509 v3 Authority Information Accessors extension CA Issuers URLs for certificates issued by the CA issuer. ([#&#8203;6486](cert-manager/cert-manager#6486), [@&#8203;jeremycampbell](https://github.com/jeremycampbell-okta))
-   Adds cert-manager's new core infrastructure initiative badge! See more details on https://www.bestpractices.dev/projects/8079 ([#&#8203;6497](cert-manager/cert-manager#6497), [@&#8203;SgtCoDFish](https://github.com/SgtCoDFish))
-   All Pods are now configured with `readOnlyRootFilesystem` by default. ([#&#8203;6453](cert-manager/cert-manager#6453), [@&#8203;wallrj](https://github.com/wallrj))
-   MAYBE BREAKING: The startupapicheck job is now handled by an entirely new container called "startupapicheck". This replaces the previous ctl container. If you run in an environment in which images cannot be pulled, be sure to include the new container. ([#&#8203;6549](cert-manager/cert-manager#6549), [@&#8203;SgtCoDFish](https://github.com/SgtCoDFish))
-   New option `.spec.keystores.pkcs12.algorithms` to specify encryption and MAC algorithms for PKCS[#&#8203;12](cert-manager/cert-manager#12) keystores. Fixes issues [#&#8203;5957](cert-manager/cert-manager#5957) and [#&#8203;6523](cert-manager/cert-manager#6523). ([#&#8203;6548](cert-manager/cert-manager#6548), [@&#8203;snorwin](https://github.com/snorwin))
-   The ACME HTTP01 solver Pod is now configured with `readOnlyRootFilesystem: true` ([#&#8203;6462](cert-manager/cert-manager#6462), [@&#8203;wallrj](https://github.com/wallrj))
-   Updates the AWS SDK for Go to 1.48.7 to support Amazon EKS Pod Identity ([#&#8203;6519](cert-manager/cert-manager#6519), [@&#8203;JoeNorth](https://github.com/JoeNorth))
-   Users can now use name constraints in CA certificates. To know more details on name constraints check out RFC section https://datatracker.ietf.org/doc/html/rfc5280#section-4.2.1.10 ([#&#8203;6500](cert-manager/cert-manager#6500), [@&#8203;tanujd11](https://github.com/tanujd11))
-   ⚠️ potentially breaking ⚠️: The KeyUsage and BasicConstraints extensions will now be encoded as critical in the CertificateRequest's CSR blob. ([#&#8203;6053](cert-manager/cert-manager#6053), [@&#8203;inteon](https://github.com/inteon))
-   Add TLS support to the metrics endpoint through either a certificate file or through dynamically issued certificates ([#&#8203;6574](cert-manager/cert-manager#6574), [@&#8203;ThatsMrTalbot](https://github.com/ThatsMrTalbot))
-   Helm Chart: allow changing the default Deployment `revisionHistoryLimit` ([#&#8203;6248](cert-manager/cert-manager#6248), [@&#8203;tberreis](https://github.com/tberreis))
-   Security: Limit the size of the response body read from HTTP requests by cert-manager. ([#&#8203;6619](cert-manager/cert-manager#6619), [@&#8203;ThatsMrTalbot](https://github.com/ThatsMrTalbot))
-   Support custom `spec.namespaceSelector` for webhooks ([#&#8203;6638](cert-manager/cert-manager#6638), [@&#8203;jkroepke](https://github.com/jkroepke))

##### Bug or Regression

-   BUGFIX\[helm]: Fix issue where webhook feature gates were only set if controller feature gates are set. ([#&#8203;6380](cert-manager/cert-manager#6380), [@&#8203;asapekia](https://github.com/asapekia))
-   Controller ConfigMap is now created only if `.Values.config` is set. ([#&#8203;6357](cert-manager/cert-manager#6357), [@&#8203;ABWassim](https://github.com/ABWassim))
-   Fix runaway bug caused by multiple Certificate resources that point to the same Secret resource. ([#&#8203;6406](cert-manager/cert-manager#6406), [@&#8203;inteon](https://github.com/inteon))
-   Fix(helm): templating of required value in controller and webhook ConfigMap resources ([#&#8203;6435](cert-manager/cert-manager#6435), [@&#8203;ABWassim](https://github.com/ABWassim))
-   Fixed a webhook validation error message when the key algorithm was invalid. ([#&#8203;6571](cert-manager/cert-manager#6571), [@&#8203;pevidex](https://github.com/pevidex))
-   Fixed error messaging when setting up vault issuer ([#&#8203;6433](cert-manager/cert-manager#6433), [@&#8203;vinny](https://github.com/vinny-sabatini))
-   `GHSA-vgf6-pvf4-34rq`: The webhook server now returns HTTP error 413 (Content Too Large) for requests with body size `>= 3MiB`. This is to mitigate DoS attacks that attempt to crash the webhook process by sending large requests that exceed the available memory.
    The webhook server now returns HTTP error 400 (Bad Request) if the request contains an empty body.
    The webhook server now returns HTTP error 500 (Internal Server Error) rather than crashing, if the code panics while handling a request. ([#&#8203;6498](cert-manager/cert-manager#6498), [@&#8203;inteon](https://github.com/inteon))
-   Increase the default webhook timeout to its maximum value of 30 seconds, so that the underlying timeout error message has more chance of being returned to the end user. ([#&#8203;6488](cert-manager/cert-manager#6488), [@&#8203;wallrj](https://github.com/wallrj))
-   Listeners that do not support TLS on Gateway resources will now not raise `BadConfig` warnings anymore ([#&#8203;6347](cert-manager/cert-manager#6347), [@&#8203;lauraseidler](https://github.com/lauraseidler))
-   Mitigate potential Slowloris attacks by setting `ReadHeaderTimeout` in all `http.Server` instances ([#&#8203;6534](cert-manager/cert-manager#6534), [@&#8203;wallrj](https://github.com/wallrj))
-   The Venafi issuer now properly resets the certificate and should no longer get stuck with `WebSDK CertRequest Module Requested Certificate` or `This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry.`. ([#&#8203;6398](cert-manager/cert-manager#6398), [@&#8203;maelvls](https://github.com/maelvls))
-   Update experimental install and uninstall commands to have flag parity with the rest of the CLI ([#&#8203;6562](cert-manager/cert-manager#6562), [@&#8203;ThatsMrTalbot](https://github.com/ThatsMrTalbot))
-   Webhook ConfigMap if now created only if `.Values.webhook.config` is set. ([#&#8203;6360](cert-manager/cert-manager#6360), [@&#8203;ABWassim](https://github.com/ABWassim))
-   BUGFIX: Ensure `otherName` SAN changes in Certificate resources trigger re-issuance. ([#&#8203;6620](cert-manager/cert-manager#6620), [@&#8203;SpectralHiss](https://github.com/SpectralHiss))
-   Bugfix: Publish the `startupapicheck` image to `quay.io` ([#&#8203;6609](cert-manager/cert-manager#6609), [@&#8203;wallrj](https://github.com/wallrj))

##### Other (Cleanup or Flake)

-   Cert-manager is now built with Go 1.21.5 ([#&#8203;6545](cert-manager/cert-manager#6545), [@&#8203;wallrj](https://github.com/wallrj))
-   Bump Go to `1.21.3` to address `CVE-2023-39325`. Also bumps base images. ([#&#8203;6410](cert-manager/cert-manager#6410), [@&#8203;SgtCoDFish](https://github.com/SgtCoDFish))
-   Bump `golang.org/x/net v0.15.0 => v0.17.0` as part of addressing `CVE-2023-44487` / `CVE-2023-39325` ([#&#8203;6427](cert-manager/cert-manager#6427), [@&#8203;SgtCoDFish](https://github.com/SgtCoDFish))
-   Check code for unintended use of `crypto/md5`, a weak cryptographic primitive; using `golangci-lint` / `gosec` (G501). ([#&#8203;6581](cert-manager/cert-manager#6581), [@&#8203;wallrj](https://github.com/wallrj))
-   Check code for unintended use of `crypto/sha1`, a weak cryptographic primitive; using `golangci-lint` / `gosec` (G505). ([#&#8203;6579](cert-manager/cert-manager#6579), [@&#8203;wallrj](https://github.com/wallrj))
-   Check code for unintended use of weak random number generator (`math/rand` instead of `crypto/rand`); using `golangci-lint` / `gosec` (G404). ([#&#8203;6582](cert-manager/cert-manager#6582), [@&#8203;wallrj](https://github.com/wallrj))
-   Cleanup: Restrict MutatingWebhookConfiguration to only CertificateRequest resources ([#&#8203;6311](cert-manager/cert-manager#6311), [@&#8203;hawksight](https://github.com/hawksight))
-   Deprecated `pkg/util.RandStringRunes` and `pkg/controller/test.RandStringBytes`. Use `k8s.io/apimachinery/pkg/util/rand.String` instead. ([#&#8203;6585](cert-manager/cert-manager#6585), [@&#8203;wallrj](https://github.com/wallrj))
-   Enabled verbose logging in startupapicheck by default, so that if it fails, users can know exactly what caused the failure. ([#&#8203;6495](cert-manager/cert-manager#6495), [@&#8203;wallrj](https://github.com/wallrj))
-   Fix gosec G601: Implicit memory aliasing of items from a range statement ([#&#8203;6551](cert-manager/cert-manager#6551), [@&#8203;wallrj](https://github.com/wallrj))
-   Fix handling of serial numbers in literal certificate subjects. Previously a serial number could be specified in `subject.serialNumber` while using a literal certificate subject. This was a mistake and has been fixed. ([#&#8203;6533](cert-manager/cert-manager#6533), [@&#8203;inteon](https://github.com/inteon))
-   The end-to-end tests can now test the cert-manager Vault Issuer on an OpenShift cluster. ([#&#8203;6391](cert-manager/cert-manager#6391), [@&#8203;wallrj](https://github.com/wallrj))
-   Update cert-manager's distroless base images from Debian 11 to Debian 12. This should have no practical effects on users. ([#&#8203;6583](cert-manager/cert-manager#6583), [@&#8203;inteon](https://github.com/inteon))
-   Updated all code using GatewayAPI to use the now GA v1 APIs ([#&#8203;6559](cert-manager/cert-manager#6559), [@&#8203;ThatsMrTalbot](https://github.com/ThatsMrTalbot))
-   Upgrade Go from 1.20.7 to 1.20.8. ([#&#8203;6369](cert-manager/cert-manager#6369), [@&#8203;inteon](https://github.com/inteon))
-   Upgrade `github.com/emicklei/go-restful/v3` to `v3.11.0` because `v3.10.2` is labeled as "DO NOT USE". ([#&#8203;6366](cert-manager/cert-manager#6366), [@&#8203;inteon](https://github.com/inteon))
-   Use the new generic `sets.Set` type in place of the deprecated `sets.String`. ([#&#8203;6586](cert-manager/cert-manager#6586), [@&#8203;wallrj](https://github.com/wallrj))
-   cert-manager is now built with Go `v1.21.6` ([#&#8203;6628](cert-manager/cert-manager#6628), [@&#8203;SgtCoDFish](https://github.com/SgtCoDFish))
-   Update the Azure SDK and remove deprecated `autorest` dependency ([#&#8203;5452](cert-manager/cert-manager#5452), [@&#8203;phillebaba](https://github.com/phillebaba))
-   The cert-manager E2E tests can now be run on Kubernetes 1.29 ([#&#8203;6641](cert-manager/cert-manager#6641), [@&#8203;wallrj](https://github.com/wallrj))

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4xNjUuMCIsInVwZGF0ZWRJblZlciI6IjM3LjE2NS4wIiwidGFyZ2V0QnJhbmNoIjoibWFpbiJ9-->

Reviewed-on: https://git.home/nrdufour/home-ops/pulls/358
Co-authored-by: Renovate <renovate@ptinem.io>
Co-committed-by: Renovate <renovate@ptinem.io>
@wallrj
Copy link
Member

wallrj commented Feb 14, 2024

I just experienced this feature for the first time, on a Kind cluster on my laptop which went to sleep overnight.

It seems to work well. Thanks @inteon.

[pod/cert-manager-5957b48cf5-t9gtf/cert-manager-controller] 2024-02-14T09:08:37.593525781Z I0214 09:08:37.593176       1 healthz.go:261] clockHealth check failed: livez
[pod/cert-manager-5957b48cf5-t9gtf/cert-manager-controller] 2024-02-14T09:08:37.593616907Z [-]clockHealth failed: the system clock is out of sync with the internal monotonic clock by 13h6m36.228644807s, which is more than the allowed 5m0s
[pod/cert-manager-5957b48cf5-t9gtf/cert-manager-controller] 2024-02-14T09:08:47.593866157Z I0214 09:08:47.593575       1 healthz.go:261] clockHealth check failed: livez
[pod/cert-manager-5957b48cf5-t9gtf/cert-manager-controller] 2024-02-14T09:08:47.593899671Z [-]clockHealth failed: the system clock is out of sync with the internal monotonic clock by 13h6m36.228644816s, which is more than the allowed 5m0s
[pod/cert-manager-5957b48cf5-t9gtf/cert-manager-controller] 2024-02-14T09:08:57.592977738Z I0214 09:08:57.592675       1 healthz.go:261] clockHealth check failed: livez
[pod/cert-manager-5957b48cf5-t9gtf/cert-manager-controller] 2024-02-14T09:08:57.593085064Z [-]clockHealth failed: the system clock is out of sync with the internal monotonic clock by 13h6m36.228644809s, which is more than the allowed 5m0s
[pod/cert-manager-5957b48cf5-t9gtf/cert-manager-controller] 2024-02-14T09:08:58.189880206Z W0214 09:08:58.189257       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
[pod/cert-manager-5957b48cf5-t9gtf/cert-manager-controller] 2024-02-14T09:08:58.203968628Z I0214 09:08:58.203730       1 options.go:248] "enabling all experimental certificatesigningrequest controllers" logger="cert-manager"
[pod/cert-manager-5957b48cf5-t9gtf/cert-manager-controller] 2024-02-14T09:08:58.203993138Z I0214 09:08:58.203770       1 options.go:253] "enabling the sig-network Gateway API certificate-shim and HTTP-01 solver" logger="cert-manager"
[pod/cert-manager-5957b48cf5-t9gtf/cert-manager-controller] 2024-02-14T09:08:58.203998248Z I0214 09:08:58.203833       1 controller.go:89] "enabled controllers: [certificaterequests-approver certificaterequests-issuer-acme certificaterequests-issuer-ca certificaterequests-issuer-selfsigned certificaterequests-issuer-vault certificaterequests-issuer-venafi certificates-issuing certificates-key-manager certificates-metrics certificates-readiness certificates-request-manager certificates-revision-manager certificates-trigger certificatesigningrequests-issuer-acme certificatesigningrequests-issuer-ca certificatesigningrequests-issuer-selfsigned certificatesigningrequests-issuer-vault certificatesigningrequests-issuer-venafi challenges clusterissuers gateway-shim ingress-shim issuers orders]" logger="cert-manager.controller"
[pod/cert-manager-5957b48cf5-t9gtf/cert-manager-controller] 2024-02-14T09:08:58.209794381Z I0214 09:08:58.209439       1 leaderelection.go:250] attempting to acquire leader lease kube-system/cert-manager-controller...
[pod/cert-manager-5957b48cf5-t9gtf/cert-manager-controller] 2024-02-14T09:08:58.216509192Z I0214 09:08:58.216126       1 leaderelection.go:260] successfully acquired lease kube-system/cert-manager-controller

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/deploy Indicates a PR modifies deployment configuration dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cert-manager should reconcile expired certificate on resume from system sleep
5 participants