Error for failed revision is not reported due to scaling to zero #14157

skonto · 2023-07-10T09:21:41Z

In what area(s)?

/area autoscale
/area api

What version of Knative?

Any.

Expected Behavior

Afaik when a new revision is created the deployment replicas is set to 1 by default in order to scale and check if the revision can be healthy (see discussion here). A customer has a scenario as follows:

Creates a revision that comes up without a problem and then scales down to zero as there is no traffic.
Makes the revision image private or removes it from their internal registry.
Issues a curl command that gets stuck (for some time) as there will be no pod coming up due to the image pull error that occurs.
The revision scales down to zero but no error is shown or at least the user does not have a specific point where to look.

The UX should be much better and although we have a debugging guide listing a number of places of where to look for an error this is not really helping. First the error goes away and several resources look ready due to scaling down to zero.

I expected the deployment not to progress and get stuck with something similar to (or at least provide that option as a normal deployment would stabilize to that) or report it and then scale down:

MinimumReplicasUnavailable Deployment does not have minimum availability

Unfortunately by default the code here cannot capture this due to scaling to zero and this still occurs even if we lower the deadline time to a reasonable value such as serving.knative.dev/progress-deadline: 30s, (although it might depend on the app to set this properly which is a bit painful).

I think we could increase autoscaling.knative.dev/scale-down-delay or set minScale to 1 or lower the request timeout but ideally I would expect Knative to report that ksvc is not healthy in this scenario although I understand that this is could be a transient error (image pull could have been fixed in the meant time). There is also some timeout here, activationTimeoutBuffer, that could be configurable? I don't see any logs for the revision not being activated.
Maybe we should mark it as unhealthy and when a new request comes in we could revaluate the ksvc status.

Actual Behavior

The behavior of the deployment is captured in the attached files, the number prefix on each filename represents time ordering. Everything seems to be fine although the deployment never reached the desired replicas.

Steps to Reproduce the Problem

Described above.

cc @evankanderson @dprotaso @mattmoor @psschwei

The text was updated successfully, but these errors were encountered:

evankanderson · 2023-07-10T19:12:00Z

Thanks, this has always bothered me, and the repro case is super-helpful!

Aggregating all Pods' errors into a Revision seems likely to cause a hotspot, but capturing errors on the one-to-zero case seems pretty safe, and even a bit clever.

Thinking further, we might be able to avoid spam on the Revision using a model a bit like Events where we rate-limit the updates and don't update if the error is the same as already reported.

dprotaso · 2023-07-10T19:56:08Z

@skonto you confirm this is a dupe of #13677 and move your repro case there and close this issue?

skonto · 2023-07-11T07:48:28Z

@dprotaso I don't think this is the same as the pods are not sticking around, the image pull error does not block the pods from scaling down. I will double check.

skonto · 2023-07-11T12:01:50Z

Indeed this is different afaik.

skonto · 2023-08-30T13:22:30Z

@evankanderson @dprotaso

We had some more runs setting minScale=1 and progressDeadline=10 that exposed another problem.
I am trying to enforce the deployment to expire to report failure at the revision level via the current revision reconciler logic (no scaling down).

Looking at the logs we are actually hitting a K8s limitations where deployment is stuck with conditions:

                "conditions": [
                    {
                        "lastTransitionTime": "2023-07-19T15:01:37Z",
                        "lastUpdateTime": "2023-07-19T15:01:41Z",
                        "message": "ReplicaSet \"hello-os-00001-deployment-7f779f79bb\" has successfully progressed.",
                        "reason": "NewReplicaSetAvailable",
                        "status": "True",
                        "type": "Progressing"
                    },
                    {
                        "lastTransitionTime": "2023-07-19T15:03:27Z",
                        "lastUpdateTime": "2023-07-19T15:03:27Z",
                        "message": "Deployment does not have minimum availability.",
                        "reason": "MinimumReplicasUnavailable",
                        "status": "False",
                        "type": "Available"
                    }
                ],

but the progressdeadline is not used if we are outside of a rolout, see the K8s issues/ reproducer here.

Now the revision shows consistently:

{
    "lastTransitionTime": "2023-07-19T15:03:27Z",
    "message": "Requests to the target are being buffered as resources are provisioned.",
    "reason": "Queued",
    "severity": "Info",
    "status": "Unknown",
    "type": "Active"
},

It also shows though:

                    {
                        "lastTransitionTime": "2023-07-19T15:01:41Z",
                        "status": "True",
                        "type": "Ready"
                    },

I am confused by the fact that we report ready in case of failures.

The Knative reconciler is not getting triggered because deployment does not change.

For data/logs see the attached files: progress-deadline-10s-minscale-1.zip

Also we have a run with minScale=0, progress-deadline-10s.zip, there the deployment status is the same but the revision status is stuck to:

{
    "lastTransitionTime": "2023-08-30T13:06:31Z",
    "message": "The target is not receiving traffic.",
    "reason": "NoTraffic",
    "severity": "Info",
    "status": "False",
    "type": "Active"
},

My understanding is that this conditions for this scenario hides the problem of the initial replica never coming up, check next.

In contrast with the above assume we have a healthy revision with minScale=0.

When we have then first replica up & running, and before scaling down revision conditions are:

"conditions": [
    {
        "lastTransitionTime": "2023-08-30T13:05:31Z",
        "severity": "Info",
        "status": "True",
        "type": "Active"
    },
    {
        "lastTransitionTime": "2023-08-30T13:05:31Z",
        "status": "True",
        "type": "ContainerHealthy"
    },
    {
        "lastTransitionTime": "2023-08-30T13:05:31Z",
        "status": "True",
        "type": "Ready"
    },
    {
        "lastTransitionTime": "2023-08-30T13:05:31Z",
        "status": "True",
        "type": "ResourcesAvailable"
    }
],

After revision is scaled down:

"conditions": [
                   {
                       "lastTransitionTime": "2023-08-30T13:06:31Z",
                       "message": "The target is not receiving traffic.",
                       "reason": "NoTraffic",
                       "severity": "Info",
                       "status": "False",
                       "type": "Active"
                   },
                   {
                       "lastTransitionTime": "2023-08-30T13:05:31Z",
                       "status": "True",
                       "type": "ContainerHealthy"
                   },
                   {
                       "lastTransitionTime": "2023-08-30T13:05:31Z",
                       "status": "True",
                       "type": "Ready"
                   },
                   {
                       "lastTransitionTime": "2023-08-30T13:05:31Z",
                       "status": "True",
                       "type": "ResourcesAvailable"
                   }
               ],

Final conditions for a healthy and non-healthy revision are the same in this case.

skonto added the kind/bug Categorizes issue or PR as related to a bug. label Jul 10, 2023

knative-prow bot added the area/autoscale label Jul 10, 2023

skonto added the area/API API objects and controllers label Jul 10, 2023

ReToCode added the triage/accepted Issues which should be fixed (post-triage) label Jul 24, 2023

This was referenced Jan 22, 2024

Don't drop traffic when upgrading a deployment fails #14795

Merged

[wip] If deployment is never available propagate the container msg #14835

Open

skonto added this to the v1.14.0 milestone Mar 13, 2024

dprotaso modified the milestones: v1.14.0, v1.15.0 May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error for failed revision is not reported due to scaling to zero #14157

Error for failed revision is not reported due to scaling to zero #14157

skonto commented Jul 10, 2023 •

edited

evankanderson commented Jul 10, 2023

dprotaso commented Jul 10, 2023

skonto commented Jul 11, 2023

skonto commented Jul 11, 2023

skonto commented Aug 30, 2023 •

edited

Error for failed revision is not reported due to scaling to zero #14157

Error for failed revision is not reported due to scaling to zero #14157

Comments

skonto commented Jul 10, 2023 • edited

In what area(s)?

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

evankanderson commented Jul 10, 2023

dprotaso commented Jul 10, 2023

skonto commented Jul 11, 2023

skonto commented Jul 11, 2023

skonto commented Aug 30, 2023 • edited

skonto commented Jul 10, 2023 •

edited

skonto commented Aug 30, 2023 •

edited