Liveness and Readiness Probes Consistently Failing #824

throwanexception · 2023-06-07T19:49:30Z

Description
We're testing out the policy-controller and the Readiness and Liveness probes for the cosign-policy-controller-webhook begin to fail after an extended amount of time (~18-24 hours). Up until then the deployment appears to work correctly.

44m         Warning   Unhealthy          pod/cosign-policy-controller-webhook-bc7d858f6-49mz7   Readiness probe failed: Get "https://x:8443/readyz": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
7s          Warning   BackOff            pod/cosign-policy-controller-webhook-bc7d858f6-49mz7   Back-off restarting failed container
14m         Warning   Unhealthy          pod/cosign-policy-controller-webhook-bc7d858f6-49mz7   Liveness probe failed: Get "https://x:8443/healthz": read tcp x:56308->x:8443: read: connection reset by peer
40m         Warning   Unhealthy          pod/cosign-policy-controller-webhook-bc7d858f6-m8phs   Liveness probe failed: Get "https://x:8443/healthz": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

After this, the Deployment will continually crash every few minutes.

We've also noticed that we'll get errors about the image digest:
'admission webhook 'policy.sigstore.dev' denied the request: validation failed: invalid value: (pods) must be an image digest: spec.template.spec.containers[0].image'.

Upon retry, it will (usually) resolve the image to a digest correctly.

Our setup is using IRSA to attach the WebIdentityToken to the pod - this is natively supported by go-containerregistry so it seems to work correctly here, but unsure if it might be related or not. The pods we're pulling are from ECR so the IRSA WebIdentityToken is used to provide the permissions to access images.

The image policy we're using is a single ecdsa256 public key to verify our images, so it seems unlikely to be related.

Our clusters are quite active, especially with the constant synthetic health checking we have going, so images are being pulled frequently for end to end testing. I enabled knative debug logging by changing the configmaps for the services, but the debug output has not been helpful so far.

Any guidance or help would be appreciated!

Version
v0.7.0 of the policy-controller

The text was updated successfully, but these errors were encountered:

hectorj2f · 2023-07-10T10:51:52Z

@throwanexception I'd recommend you to use our latest version, we've simplified the deployment to use a single webhook to verify you aren't experiencing the crash.

throwanexception · 2023-07-10T21:42:49Z

@throwanexception I'd recommend you to use our latest version, we've simplified the deployment to use a single webhook to verify you aren't experiencing the crash.

Took about a week of constant usage with the 0.8.0 release with our clusters and we're seeing a similar issue as I reported for v0.7.0. The policy-controller begins to timeout readiness / liveness probes and is restarted by the scheduler. We also see the same exceptions around the image digest when this occurs. From what I can observe the memory usage is growing unbounded (possibly a leak?):

cosign-system)]$ k top pod
NAME                                 CPU(cores)   MEMORY(bytes)
policy-controller-555465fd55-g67kc   1332m        1730Mi
policy-controller-555465fd55-mmtw4   1166m        1282Mi
policy-controller-555465fd55-qpcrt   948m         1511Mi

hectorj2f · 2023-07-11T11:18:14Z

'admission webhook 'policy.sigstore.dev' denied the request: validation failed: invalid value: (pods) must be an image digest: spec.template.spec.containers[0].image'.

This error is supposed to happen whenever the image cannot be parsed to a digest.

Regarding the growing memory usage, I'd observe the logs to identify what is going on the controller. We're using the policy-controller in our cluster and we haven't experienced this memory growing behaviour.

The policy-controller begins to timeout readiness / liveness probes and is restarted by the scheduler

This is weird. I'd try changing the values of the liveness / readiness probes to see if it is an issue related to the growing mem or cpu consumption.

Did you see this memory growing behaviour with v0.7.0 too ?

throwanexception added the bug Something isn't working label Jun 7, 2023

throwanexception changed the title ~~Liveness and Readiness Probes Consistently Failing After Extended TIme~~ Liveness and Readiness Probes Consistently Failing Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Liveness and Readiness Probes Consistently Failing #824

Liveness and Readiness Probes Consistently Failing #824

throwanexception commented Jun 7, 2023

hectorj2f commented Jul 10, 2023

throwanexception commented Jul 10, 2023

hectorj2f commented Jul 11, 2023

Liveness and Readiness Probes Consistently Failing #824

Liveness and Readiness Probes Consistently Failing #824

Comments

throwanexception commented Jun 7, 2023

hectorj2f commented Jul 10, 2023

throwanexception commented Jul 10, 2023

hectorj2f commented Jul 11, 2023