[wip] If deployment is never available propagate the container msg #14835

skonto · 2024-01-24T14:40:39Z

Proposed Changes

This propagates the msg of the container t when deployment never reaches availability and keeps having:

        {
            "lastTransitionTime": "2024-01-23T20:05:38Z",
            "message": "Deployment does not have minimum availability.",
            "reason": "MinimumReplicasUnavailable",
            "status": "False",
            "type": "Ready"
        }

Then the ksvc when the deployment is scaled back to zero will have:


           "conditions": [
                    {
                        "lastTransitionTime": "2024-04-02T09:18:31Z",
                        "message": "Revision \"helloworld-go-00001\" failed with message: Back-off pulling image \"index.docker.io/skonto/helloworld-go@sha256:dd20d7659c16bdc58c09740a543ef3c36b7c04742a2b6b280a30c2a76dcf6c09\".",
                        "reason": "RevisionFailed",
                        "status": "False",
                        "type": "ConfigurationsReady"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T09:14:44Z",
                        "message": "Revision \"helloworld-go-00001\" failed to become ready.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "Ready"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T09:14:44Z",
                        "message": "Revision \"helloworld-go-00001\" failed to become ready.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "RoutesReady"
                    }
                ],

This changes the initial state from unknown to "failed" even for normal ksvcs, however once we are up this is cleared at the configuration level as well. The idea for this initial state comes from the fact that K8s for readiness probes also considers a probe failed until InitialDelaySeconds pass.
Right now we have when a ksvc is deployed, at the beginning of it lifecycle:

                "conditions": [
                    {
                        "lastTransitionTime": "2024-04-02T10:34:55Z",
                        "status": "Unknown",
                        "type": "ConfigurationsReady"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T10:34:55Z",
                        "message": "Configuration \"helloworld-go\" is waiting for a Revision to become ready.",
                        "reason": "RevisionMissing",
                        "status": "Unknown",
                        "type": "Ready"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T10:34:55Z",
                        "message": "Configuration \"helloworld-go\" is waiting for a Revision to become ready.",
                        "reason": "RevisionMissing",
                        "status": "Unknown",
                        "type": "RoutesReady"
                    }

With this PR we start with a failing status until it is cleared:

                "conditions": [
                    {
                        "lastTransitionTime": "2024-04-02T10:31:01Z",
                        "message": "Revision \"helloworld-go-00001\" failed with message: .",
                        "reason": "RevisionFailed",
                        "status": "False",
                        "type": "ConfigurationsReady"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T10:31:01Z",
                        "message": "Configuration \"helloworld-go\" does not have any ready Revision.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "Ready"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T10:31:01Z",
                        "message": "Configuration \"helloworld-go\" does not have any ready Revision.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "RoutesReady"
                    }

To reproduce the initial issue with minikube you can use the following:
- apply a ksvc
- wait for a pod to come up and then ksvc to scale to zero
- minikube image list and then minikube image rm so that no image is available within minikube for the user container.
- block any internet access so image can't be pulled
- issue a request via curl ...
- observe the revision and deployment statuses
When the issue is resolved the next request will clear the status messages:

               "conditions": [
                   {
                       "lastTransitionTime": "2024-04-02T09:37:03Z",
                       "status": "True",
                       "type": "ConfigurationsReady"
                   },
                   {
                       "lastTransitionTime": "2024-04-02T09:37:03Z",
                       "status": "True",
                       "type": "Ready"
                   },
                   {
                       "lastTransitionTime": "2024-04-02T09:37:03Z",
                       "status": "True",
                       "type": "RoutesReady"
                   }

codecov · 2024-01-24T14:46:30Z

Codecov Report

Attention: Patch coverage is 52.17391% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 85.22%. Comparing base (c2d0af1) to head (3b7524c).
Report is 119 commits behind head on main.

Files	Patch %	Lines
pkg/apis/serving/v1/revision_lifecycle.go	0.00%	11 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #14835      +/-   ##
==========================================
+ Coverage   84.11%   85.22%   +1.11%     
==========================================
  Files         213      213              
  Lines       16783    13322    -3461     
==========================================
- Hits        14117    11354    -2763     
+ Misses       2315     1622     -693     
+ Partials      351      346       -5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

skonto · 2024-01-25T13:45:17Z

test/e2e/image_pull_error_test.go

@@ -52,13 +52,10 @@ func TestImagePullError(t *testing.T) {
 		cond := r.Status.GetCondition(v1.ConfigurationConditionReady)
 		if cond != nil && !cond.IsUnknown() {
 			if cond.IsFalse() {
-				if cond.Reason == wantCfgReason {
+				if cond.Reason == wantCfgReason && strings.Contains(cond.Message, "Back-off pulling image") {


Previously the configuration will be go from status:

k get configuration -n serving-tests NAME LATESTCREATED LATESTREADY READY REASON image-pull-error-dagmxojy image-pull-error-dagmxojy-00001 Unknown

to status failed after the progressdeadline was exceeded (120s in tests).
Here instead, due to this the patch, as soon as it sees no availability at the deployment side it will mark the revision as ready=false and configuration will get:

{ "lastTransitionTime": "2024-01-25T13:44:18Z", "message": "Revision \"image-pull-error-kbfvvcsg-00001\" failed with message: Deployment does not have minimum availability..", "reason": "RevisionFailed", "status": "False", "type": "Ready" }

Later on after progressdeadline is passed it will get expected msg here.
That is why we will have to wait.

"conditions": [ { "lastTransitionTime": "2024-01-25T13:36:08Z", "message": "Revision \"image-pull-error-gnwactac-00001\" failed with message: Deployment does not have minimum availability..", "reason": "RevisionFailed", "status": "False", "type": "Ready" }

Need to cover rpc error: code = NotFound desc = failed to pull and unpack image as well, as seen from the failures.

skonto · 2024-01-25T14:10:04Z

pkg/reconciler/revision/table_test.go

@@ -549,6 +549,30 @@ func TestReconcile(t *testing.T) {
 			Object: pa("foo", "pull-backoff", WithReachabilityUnreachable),
 		}},
 		Key: "foo/pull-backoff",
+	}, {
+		Name: "surface ImagePullBackoff when previously scaled ok but now image is missing",


This covers the case where we are not in a rolout eg. scaling from zero and image is not there.
The goal is that for this scenario when we scale down to zero we will have:

{ "lastTransitionTime": "2024-01-23T20:16:43Z", "message": "The target is not receiving traffic.", "reason": "NoTraffic", "severity": "Info", "status": "False", "type": "Active" }, { "lastTransitionTime": "2024-01-23T20:03:07Z", "status": "True", "type": "ContainerHealthy" }, { "lastTransitionTime": "2024-01-23T20:05:38Z", "message": "Deployment does not have minimum availability.", "reason": "MinimumReplicasUnavailable", "status": "False", "type": "Ready" }, { "lastTransitionTime": "2024-01-23T20:05:38Z", "message": "Deployment does not have minimum availability.", "reason": "MinimumReplicasUnavailable", "status": "False", "type": "ResourcesAvailable" }

instead of

{ "lastTransitionTime": "2024-01-23T20:03:07Z", "severity": "Info", "status": "True", "type": "Active" }, { "lastTransitionTime": "2024-01-23T20:03:07Z", "status": "True", "type": "ContainerHealthy" }, { "lastTransitionTime": "2024-01-23T20:03:07Z", "status": "True", "type": "Ready" }, { "lastTransitionTime": "2024-01-23T20:03:07Z", "status": "True", "type": "ResourcesAvailable" }

so user knows that something was not ok.

dprotaso · 2024-01-25T14:28:47Z

The earlier concerns with using Available is that it changes when the revision is scaling. Thus Available=False when scaling up until all the pods are ready.

We don't want to mark the revision as ready=false when scaling is occuring. I haven't dug into the code changes in the PR yet but how do we handle that scenario?

skonto · 2024-01-25T14:48:45Z

We don't want to mark the revision as ready=false when scaling is occuring.

My goal is to only touch status when progressdeadline does not seems to work (kubernetes/kubernetes#106054). That means only apply when if *deployment.Spec.Replicas > 0 && deployment.Status.AvailableReplicas == 0 holds and there is some waiting happening which means in that case nothing is ready actually. Also I only set the revision availability to false if it is not false already, otherwise I don't touch it.

dprotaso · 2024-01-25T14:58:19Z

Yeah ProgressDeadline only seems to apply when doing a rollout from one replicaset to another one

dprotaso · 2024-01-25T15:23:34Z

I discovered that here
kubernetes/kubernetes#106697

skonto · 2024-02-02T14:44:04Z

/test istio-latest-no-mesh

skonto · 2024-02-05T09:15:08Z

infra
/test istio-latest-no-mesh

skonto · 2024-02-05T09:15:32Z

@dprotaso @ReToCode gentle ping

ReToCode · 2024-02-05T14:36:42Z

pkg/apis/serving/v1/revision_lifecycle.go

+	m := revisionCondSet.Manage(rs)
+	avCond := m.GetCondition(RevisionConditionResourcesAvailable)
+
+	// Skip if set for other reasons


What other reasons? Why would we need to skip in that case?

I don't want to change the current state machine see comment: #14835 (comment). So I am only targeting a specific case.
Now if for example the deployment faced an issue and revision avail condition is set to false already, I am not going to update it and set it to false again, I am just skipping the update and keep things as is.
In general it should not make a difference as if that was an intermediate state to have avail cond false (it happens when replicas are not ready) for some other reason, then it is going to be reset to true and then later we will set it to false again anyway.
I can try without this in a test PR but I am wondering for side effects in general.

Ok, the I think the comment was just not explaining this fully.

ReToCode · 2024-02-05T14:38:21Z

pkg/testing/v1/revision.go

@@ -137,6 +137,12 @@ func MarkInactive(reason, message string) RevisionOption {
 	}
 }

+func MarkActiveUknown(reason, message string) RevisionOption {


ReToCode · 2024-02-05T14:39:06Z

test/e2e/image_pull_error_test.go

@@ -92,3 +90,8 @@ func createLatestConfig(t *testing.T, clients *test.Clients, names test.Resource
 		c.Spec = *v1test.ConfigurationSpec(names.Image)
 	})
 }
+
+func hasPullErrorMsg(msg string) bool {
+	return strings.Contains(msg, "Back-off pulling image") ||


where do these strings come from? How do we know it's all of them?

It is not hard by trial and error, see previous PR here: #4348.
I know it it not that great but I don't expect it to be that of a problem.
Alternatively we can make it independent of the string, so I will have to wait for the configuration to fail and then wait also for revision to have the right status. The reason I have it this way is that if I dont check for the right configuration status with the right msg it goes quickly to check the revision and since there is no wait right now for the revision check it will fail immediately (at the revision check).
Note here that with this patch we change the initial status to be false for the configuration so the first configuration check in the test passes quickly.

dprotaso · 2024-02-20T00:06:49Z

pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go

+func (pas *PodAutoscalerStatus) MarkNotReady(reason, mes string) {
+	podCondSet.Manage(pas).MarkUnknown(PodAutoscalerConditionReady, reason, mes)
+}


FYI this isn't needed because - marking SKSReady=False will mark the PA Ready=False

serving/pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go

Lines 34 to 38 in ac979ec

var podCondSet = apis.NewLivingConditionSet(

PodAutoscalerConditionActive,

PodAutoscalerConditionScaleTargetInitialized,

PodAutoscalerConditionSKSReady,

)

Ok I will check it.

This is only used in tests (table_test.go) to show the expected status for pa but probably can be removed.

dprotaso · 2024-02-21T02:06:53Z

pkg/reconciler/revision/reconcile_resources.go

+							logger.Infof("marking resources unavailable with: %s: %s", w.Reason, w.Message)
+							rev.Status.MarkResourcesAvailableFalse(w.Reason, w.Message)
+						} else {
+							rev.Status.PropagateDeploymentAvailabilityStatusIfFalse(&deployment.Status)


The status message on minAvailable isn't very informative. In the reported issue I'm wondering if it's better to surface ErrImagePull/ImagePullBackOff from the Pod.

I believe that's what's the original code intended - but hasDeploymentTimedOut is broken because it doesn't take into account a deployment not changing and scaling from zero. It seems like that's what we should be addressing here

I kept it generic just to show that availability was never reached. Let me check if I can get the container status and how that works.

dprotaso · 2024-03-26T19:12:59Z

hey @skonto are you still working on this?

skonto · 2024-03-28T13:24:02Z

hey @skonto are you still working on this?

@dprotaso yes I will take a look to update it based on your comment here: #14835 (comment).

skonto · 2024-04-02T12:29:03Z

pkg/reconciler/revision/reconcile_resources.go

@@ -72,7 +76,10 @@ func (c *Reconciler) reconcileDeployment(ctx context.Context, rev *v1.Revision)
 		return fmt.Errorf("failed to update deployment %q: %w", deploymentName, err)
 	}

-	rev.Status.PropagateDeploymentStatus(&deployment.Status)
+	// When we are scaling down we want to keep the error that we might have seen before
+	if *deployment.Spec.Replicas > 0 {


!rev.Status.IsActivationRequired() is not not enough because we go back to the replicaset failure coverage issue.

skonto · 2024-04-02T17:12:40Z

@dprotaso could you please take a look at the current fix? It is quite unfortunate that progress deadline does not cover this because the fix belongs to K8s not here imho. 🤷
Btw it is hard to follow in general how resources are updated. Reason is controllers often have more than one resource that they create/manage and then we rely on triggering reconciliation on every resource that changes, hopefully reaching the ksvc status in order to update it at some point. The graph is basically more or less like this (omitting SKS):

ksvc <-> config <->{route, revision}
revision <-> {pa, certificate, deployment}
I am wondering if we could simplify this further.

knative-prow · 2024-04-03T07:26:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: skonto
Once this PR has been reviewed and has the lgtm label, please ask for approval from dprotaso. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

~~OWNERS~~ [skonto]
pkg/apis/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dprotaso · 2024-04-22T14:37:54Z

closing reopening to trigger new github actions to run

dprotaso · 2024-04-23T01:56:30Z

/retest

dprotaso · 2024-04-23T02:11:25Z

Looks like the failures are legit in the api package

dprotaso · 2024-04-23T02:12:04Z

/hold for release tomorrow (if prow works ;) )

skonto · 2024-04-26T07:31:53Z

It needs more work.

knative-prow · 2024-05-21T20:27:20Z

@skonto: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
istio-latest-no-mesh_serving_main	`3b7524c`	link	true	`/test istio-latest-no-mesh`
certmanager-integration-tests_serving_main	`3b7524c`	link	true	`/test certmanager-integration-tests`

Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 24, 2024

knative-prow bot requested review from dprotaso, evankanderson and ReToCode January 24, 2024 14:40

knative-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/API API objects and controllers area/autoscale area/test-and-release It flags unit/e2e/conformance/perf test issues for product features labels Jan 24, 2024

skonto removed the request for review from evankanderson January 24, 2024 14:41

knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 25, 2024

skonto commented Jan 25, 2024

View reviewed changes

skonto changed the title ~~[WIP] If deployment is never available propagate the msg~~ If deployment is never available propagate the msg Feb 2, 2024

knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 2, 2024

ReToCode reviewed Feb 5, 2024

View reviewed changes

dprotaso reviewed Feb 21, 2024

View reviewed changes

knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 27, 2024

skonto commented Apr 2, 2024

View reviewed changes

knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 2, 2024

skonto assigned dprotaso and ReToCode Apr 2, 2024

skonto changed the title ~~[wip] If deployment is never available propagate the container msg~~ If deployment is never available propagate the container msg Apr 3, 2024

knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 3, 2024

terrytangyuan mentioned this pull request Apr 10, 2024

Ability to change parameters of of the default Deployment created from ServingRuntime and InferenceService kserve/kserve#3452

Closed

skonto added 5 commits April 11, 2024 14:34

if deployment is never available propagate the msg

097a4c3

fix for progressdeadline case

d02c040

only wait for configuration and for the right failure

a89e66d

fix lint

8cfe652

fix test

7835476

skonto force-pushed the propagate_rev_not_available branch 2 times, most recently from b487995 to 31ced74 Compare April 11, 2024 11:35

fix rebase

3b7524c

skonto force-pushed the propagate_rev_not_available branch from 31ced74 to 3b7524c Compare April 11, 2024 11:36

dprotaso closed this Apr 22, 2024

dprotaso reopened this Apr 22, 2024

knative-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 23, 2024

skonto changed the title ~~If deployment is never available propagate the container msg~~ [wip] If deployment is never available propagate the container msg Apr 26, 2024

knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] If deployment is never available propagate the container msg #14835

[wip] If deployment is never available propagate the container msg #14835

skonto commented Jan 24, 2024 •

edited

codecov bot commented Jan 24, 2024 •

edited

skonto Jan 25, 2024 •

edited

skonto Jan 25, 2024

skonto Jan 25, 2024 •

edited

dprotaso commented Jan 25, 2024

skonto commented Jan 25, 2024 •

edited

dprotaso commented Jan 25, 2024

dprotaso commented Jan 25, 2024

skonto commented Feb 2, 2024

skonto commented Feb 5, 2024

skonto commented Feb 5, 2024

ReToCode Feb 5, 2024

skonto Feb 5, 2024 •

edited

ReToCode Feb 6, 2024

ReToCode Feb 5, 2024

ReToCode Feb 5, 2024

skonto Feb 5, 2024

dprotaso Feb 20, 2024

skonto Feb 21, 2024

skonto Mar 28, 2024 •

edited

dprotaso Feb 21, 2024

skonto Feb 21, 2024 •

edited

dprotaso commented Mar 26, 2024

skonto commented Mar 28, 2024 •

edited

skonto Apr 2, 2024 •

edited

skonto commented Apr 2, 2024 •

edited

knative-prow bot commented Apr 3, 2024

dprotaso commented Apr 22, 2024

dprotaso commented Apr 23, 2024

dprotaso commented Apr 23, 2024

dprotaso commented Apr 23, 2024 •

edited

skonto commented Apr 26, 2024

knative-prow bot commented May 21, 2024

	var podCondSet = apis.NewLivingConditionSet(
	PodAutoscalerConditionActive,
	PodAutoscalerConditionScaleTargetInitialized,
	PodAutoscalerConditionSKSReady,
	)

[wip] If deployment is never available propagate the container msg #14835

Are you sure you want to change the base?

[wip] If deployment is never available propagate the container msg #14835

Conversation

skonto commented Jan 24, 2024 • edited

Proposed Changes

codecov bot commented Jan 24, 2024 • edited

Codecov Report

skonto Jan 25, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto Jan 25, 2024 • edited

Choose a reason for hiding this comment

dprotaso commented Jan 25, 2024

skonto commented Jan 25, 2024 • edited

dprotaso commented Jan 25, 2024

dprotaso commented Jan 25, 2024

skonto commented Feb 2, 2024

skonto commented Feb 5, 2024

skonto commented Feb 5, 2024

Choose a reason for hiding this comment

skonto Feb 5, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto Mar 28, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto Feb 21, 2024 • edited

Choose a reason for hiding this comment

dprotaso commented Mar 26, 2024

skonto commented Mar 28, 2024 • edited

skonto Apr 2, 2024 • edited

Choose a reason for hiding this comment

skonto commented Apr 2, 2024 • edited

knative-prow bot commented Apr 3, 2024

dprotaso commented Apr 22, 2024

dprotaso commented Apr 23, 2024

dprotaso commented Apr 23, 2024

dprotaso commented Apr 23, 2024 • edited

skonto commented Apr 26, 2024

knative-prow bot commented May 21, 2024

skonto commented Jan 24, 2024 •

edited

codecov bot commented Jan 24, 2024 •

edited

skonto Jan 25, 2024 •

edited

skonto Jan 25, 2024 •

edited

skonto commented Jan 25, 2024 •

edited

skonto Feb 5, 2024 •

edited

skonto Mar 28, 2024 •

edited

skonto Feb 21, 2024 •

edited

skonto commented Mar 28, 2024 •

edited

skonto Apr 2, 2024 •

edited

skonto commented Apr 2, 2024 •

edited

dprotaso commented Apr 23, 2024 •

edited