Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] If deployment is never available propagate the container msg #14835

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

skonto
Copy link
Contributor

@skonto skonto commented Jan 24, 2024

Fixes #14157

Proposed Changes

  • This propagates the msg of the container t when deployment never reaches availability and keeps having:
        {
            "lastTransitionTime": "2024-01-23T20:05:38Z",
            "message": "Deployment does not have minimum availability.",
            "reason": "MinimumReplicasUnavailable",
            "status": "False",
            "type": "Ready"
        }

Then the ksvc when the deployment is scaled back to zero will have:


           "conditions": [
                    {
                        "lastTransitionTime": "2024-04-02T09:18:31Z",
                        "message": "Revision \"helloworld-go-00001\" failed with message: Back-off pulling image \"index.docker.io/skonto/helloworld-go@sha256:dd20d7659c16bdc58c09740a543ef3c36b7c04742a2b6b280a30c2a76dcf6c09\".",
                        "reason": "RevisionFailed",
                        "status": "False",
                        "type": "ConfigurationsReady"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T09:14:44Z",
                        "message": "Revision \"helloworld-go-00001\" failed to become ready.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "Ready"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T09:14:44Z",
                        "message": "Revision \"helloworld-go-00001\" failed to become ready.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "RoutesReady"
                    }
                ],


  • This changes the initial state from unknown to "failed" even for normal ksvcs, however once we are up this is cleared at the configuration level as well. The idea for this initial state comes from the fact that K8s for readiness probes also considers a probe failed until InitialDelaySeconds pass.
    Right now we have when a ksvc is deployed, at the beginning of it lifecycle:
                "conditions": [
                    {
                        "lastTransitionTime": "2024-04-02T10:34:55Z",
                        "status": "Unknown",
                        "type": "ConfigurationsReady"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T10:34:55Z",
                        "message": "Configuration \"helloworld-go\" is waiting for a Revision to become ready.",
                        "reason": "RevisionMissing",
                        "status": "Unknown",
                        "type": "Ready"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T10:34:55Z",
                        "message": "Configuration \"helloworld-go\" is waiting for a Revision to become ready.",
                        "reason": "RevisionMissing",
                        "status": "Unknown",
                        "type": "RoutesReady"
                    }

With this PR we start with a failing status until it is cleared:

                "conditions": [
                    {
                        "lastTransitionTime": "2024-04-02T10:31:01Z",
                        "message": "Revision \"helloworld-go-00001\" failed with message: .",
                        "reason": "RevisionFailed",
                        "status": "False",
                        "type": "ConfigurationsReady"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T10:31:01Z",
                        "message": "Configuration \"helloworld-go\" does not have any ready Revision.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "Ready"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T10:31:01Z",
                        "message": "Configuration \"helloworld-go\" does not have any ready Revision.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "RoutesReady"
                    }
  • To reproduce the initial issue with minikube you can use the following:
    • apply a ksvc
    • wait for a pod to come up and then ksvc to scale to zero
    • minikube image list and then minikube image rm so that no image is available within minikube for the user container.
    • block any internet access so image can't be pulled
    • issue a request via curl ...
    • observe the revision and deployment statuses
  • When the issue is resolved the next request will clear the status messages:
               "conditions": [
                   {
                       "lastTransitionTime": "2024-04-02T09:37:03Z",
                       "status": "True",
                       "type": "ConfigurationsReady"
                   },
                   {
                       "lastTransitionTime": "2024-04-02T09:37:03Z",
                       "status": "True",
                       "type": "Ready"
                   },
                   {
                       "lastTransitionTime": "2024-04-02T09:37:03Z",
                       "status": "True",
                       "type": "RoutesReady"
                   }

@knative-prow knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 24, 2024
@knative-prow knative-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/API API objects and controllers area/autoscale area/test-and-release It flags unit/e2e/conformance/perf test issues for product features labels Jan 24, 2024
@skonto skonto removed the request for review from evankanderson January 24, 2024 14:41
Copy link

codecov bot commented Jan 24, 2024

Codecov Report

Attention: Patch coverage is 52.17391% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 85.22%. Comparing base (c2d0af1) to head (3b7524c).
Report is 119 commits behind head on main.

Files Patch % Lines
pkg/apis/serving/v1/revision_lifecycle.go 0.00% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14835      +/-   ##
==========================================
+ Coverage   84.11%   85.22%   +1.11%     
==========================================
  Files         213      213              
  Lines       16783    13322    -3461     
==========================================
- Hits        14117    11354    -2763     
+ Misses       2315     1622     -693     
+ Partials      351      346       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@knative-prow knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 25, 2024
@@ -52,13 +52,10 @@ func TestImagePullError(t *testing.T) {
cond := r.Status.GetCondition(v1.ConfigurationConditionReady)
if cond != nil && !cond.IsUnknown() {
if cond.IsFalse() {
if cond.Reason == wantCfgReason {
if cond.Reason == wantCfgReason && strings.Contains(cond.Message, "Back-off pulling image") {
Copy link
Contributor Author

@skonto skonto Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the configuration will be go from status:

k get configuration  -n serving-tests 
NAME                        LATESTCREATED                     LATESTREADY   READY     REASON
image-pull-error-dagmxojy   image-pull-error-dagmxojy-00001                 Unknown 

to status failed after the progressdeadline was exceeded (120s in tests).
Here instead, due to this the patch, as soon as it sees no availability at the deployment side it will mark the revision as ready=false and configuration will get:

                    {
                        "lastTransitionTime": "2024-01-25T13:44:18Z",
                        "message": "Revision \"image-pull-error-kbfvvcsg-00001\" failed with message: Deployment does not have minimum availability..",
                        "reason": "RevisionFailed",
                        "status": "False",
                        "type": "Ready"
                    }

Later on after progressdeadline is passed it will get expected msg here.
That is why we will have to wait.

                "conditions": [
                    {
                        "lastTransitionTime": "2024-01-25T13:36:08Z",
                        "message": "Revision \"image-pull-error-gnwactac-00001\" failed with message: Deployment does not have minimum availability..",
                        "reason": "RevisionFailed",
                        "status": "False",
                        "type": "Ready"
                    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to cover rpc error: code = NotFound desc = failed to pull and unpack image as well, as seen from the failures.

@@ -549,6 +549,30 @@ func TestReconcile(t *testing.T) {
Object: pa("foo", "pull-backoff", WithReachabilityUnreachable),
}},
Key: "foo/pull-backoff",
}, {
Name: "surface ImagePullBackoff when previously scaled ok but now image is missing",
Copy link
Contributor Author

@skonto skonto Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This covers the case where we are not in a rolout eg. scaling from zero and image is not there.
The goal is that for this scenario when we scale down to zero we will have:

        {
            "lastTransitionTime": "2024-01-23T20:16:43Z",
            "message": "The target is not receiving traffic.",
            "reason": "NoTraffic",
            "severity": "Info",
            "status": "False",
            "type": "Active"
        },
        {
            "lastTransitionTime": "2024-01-23T20:03:07Z",
            "status": "True",
            "type": "ContainerHealthy"
        },
        {
            "lastTransitionTime": "2024-01-23T20:05:38Z",
            "message": "Deployment does not have minimum availability.",
            "reason": "MinimumReplicasUnavailable",
            "status": "False",
            "type": "Ready"
        },
        {
            "lastTransitionTime": "2024-01-23T20:05:38Z",
            "message": "Deployment does not have minimum availability.",
            "reason": "MinimumReplicasUnavailable",
            "status": "False",
            "type": "ResourcesAvailable"
        }

instead of

    {
             "lastTransitionTime": "2024-01-23T20:03:07Z",
             "severity": "Info",
             "status": "True",
             "type": "Active"
         },
         {
             "lastTransitionTime": "2024-01-23T20:03:07Z",
             "status": "True",
             "type": "ContainerHealthy"
         },
         {
             "lastTransitionTime": "2024-01-23T20:03:07Z",
             "status": "True",
             "type": "Ready"
         },
         {
             "lastTransitionTime": "2024-01-23T20:03:07Z",
             "status": "True",
             "type": "ResourcesAvailable"
         }

so user knows that something was not ok.

@dprotaso
Copy link
Member

The earlier concerns with using Available is that it changes when the revision is scaling. Thus Available=False when scaling up until all the pods are ready.

We don't want to mark the revision as ready=false when scaling is occuring. I haven't dug into the code changes in the PR yet but how do we handle that scenario?

@skonto
Copy link
Contributor Author

skonto commented Jan 25, 2024

We don't want to mark the revision as ready=false when scaling is occuring.

My goal is to only touch status when progressdeadline does not seems to work (kubernetes/kubernetes#106054). That means only apply when if *deployment.Spec.Replicas > 0 && deployment.Status.AvailableReplicas == 0 holds and there is some waiting happening which means in that case nothing is ready actually. Also I only set the revision availability to false if it is not false already, otherwise I don't touch it.

@dprotaso
Copy link
Member

Yeah ProgressDeadline only seems to apply when doing a rollout from one replicaset to another one

@dprotaso
Copy link
Member

I discovered that here
kubernetes/kubernetes#106697

@skonto skonto changed the title [WIP] If deployment is never available propagate the msg If deployment is never available propagate the msg Feb 2, 2024
@knative-prow knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 2, 2024
@skonto
Copy link
Contributor Author

skonto commented Feb 2, 2024

/test istio-latest-no-mesh

@skonto
Copy link
Contributor Author

skonto commented Feb 5, 2024

infra
/test istio-latest-no-mesh

@skonto
Copy link
Contributor Author

skonto commented Feb 5, 2024

@dprotaso @ReToCode gentle ping

m := revisionCondSet.Manage(rs)
avCond := m.GetCondition(RevisionConditionResourcesAvailable)

// Skip if set for other reasons
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other reasons? Why would we need to skip in that case?

Copy link
Contributor Author

@skonto skonto Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to change the current state machine see comment: #14835 (comment). So I am only targeting a specific case.
Now if for example the deployment faced an issue and revision avail condition is set to false already, I am not going to update it and set it to false again, I am just skipping the update and keep things as is.
In general it should not make a difference as if that was an intermediate state to have avail cond false (it happens when replicas are not ready) for some other reason, then it is going to be reset to true and then later we will set it to false again anyway.
I can try without this in a test PR but I am wondering for side effects in general.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the I think the comment was just not explaining this fully.

@@ -137,6 +137,12 @@ func MarkInactive(reason, message string) RevisionOption {
}
}

func MarkActiveUknown(reason, message string) RevisionOption {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

@@ -92,3 +90,8 @@ func createLatestConfig(t *testing.T, clients *test.Clients, names test.Resource
c.Spec = *v1test.ConfigurationSpec(names.Image)
})
}

func hasPullErrorMsg(msg string) bool {
return strings.Contains(msg, "Back-off pulling image") ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do these strings come from? How do we know it's all of them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not hard by trial and error, see previous PR here: #4348.
I know it it not that great but I don't expect it to be that of a problem.
Alternatively we can make it independent of the string, so I will have to wait for the configuration to fail and then wait also for revision to have the right status. The reason I have it this way is that if I dont check for the right configuration status with the right msg it goes quickly to check the revision and since there is no wait right now for the revision check it will fail immediately (at the revision check).
Note here that with this patch we change the initial status to be false for the configuration so the first configuration check in the test passes quickly.

Comment on lines 229 to 231
func (pas *PodAutoscalerStatus) MarkNotReady(reason, mes string) {
podCondSet.Manage(pas).MarkUnknown(PodAutoscalerConditionReady, reason, mes)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this isn't needed because - marking SKSReady=False will mark the PA Ready=False

var podCondSet = apis.NewLivingConditionSet(
PodAutoscalerConditionActive,
PodAutoscalerConditionScaleTargetInitialized,
PodAutoscalerConditionSKSReady,
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I will check it.

Copy link
Contributor Author

@skonto skonto Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used in tests (table_test.go) to show the expected status for pa but probably can be removed.

logger.Infof("marking resources unavailable with: %s: %s", w.Reason, w.Message)
rev.Status.MarkResourcesAvailableFalse(w.Reason, w.Message)
} else {
rev.Status.PropagateDeploymentAvailabilityStatusIfFalse(&deployment.Status)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status message on minAvailable isn't very informative. In the reported issue I'm wondering if it's better to surface ErrImagePull/ImagePullBackOff from the Pod.

I believe that's what's the original code intended - but hasDeploymentTimedOut is broken because it doesn't take into account a deployment not changing and scaling from zero. It seems like that's what we should be addressing here

Copy link
Contributor Author

@skonto skonto Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it generic just to show that availability was never reached. Let me check if I can get the container status and how that works.

@dprotaso
Copy link
Member

hey @skonto are you still working on this?

@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 27, 2024
@skonto
Copy link
Contributor Author

skonto commented Mar 28, 2024

hey @skonto are you still working on this?

@dprotaso yes I will take a look to update it based on your comment here: #14835 (comment).

@@ -72,7 +76,10 @@ func (c *Reconciler) reconcileDeployment(ctx context.Context, rev *v1.Revision)
return fmt.Errorf("failed to update deployment %q: %w", deploymentName, err)
}

rev.Status.PropagateDeploymentStatus(&deployment.Status)
// When we are scaling down we want to keep the error that we might have seen before
if *deployment.Spec.Replicas > 0 {
Copy link
Contributor Author

@skonto skonto Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!rev.Status.IsActivationRequired() is not not enough because we go back to the replicaset failure coverage issue.

@knative-prow knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 2, 2024
@skonto
Copy link
Contributor Author

skonto commented Apr 2, 2024

@dprotaso could you please take a look at the current fix? It is quite unfortunate that progress deadline does not cover this because the fix belongs to K8s not here imho. 🤷
Btw it is hard to follow in general how resources are updated. Reason is controllers often have more than one resource that they create/manage and then we rely on triggering reconciliation on every resource that changes, hopefully reaching the ksvc status in order to update it at some point. The graph is basically more or less like this (omitting SKS):

ksvc <-> config <->{route, revision}
revision <-> {pa, certificate, deployment}
I am wondering if we could simplify this further.

Copy link

knative-prow bot commented Apr 3, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: skonto
Once this PR has been reviewed and has the lgtm label, please ask for approval from dprotaso. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@skonto skonto changed the title [wip] If deployment is never available propagate the container msg If deployment is never available propagate the container msg Apr 3, 2024
@knative-prow knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 3, 2024
@skonto skonto force-pushed the propagate_rev_not_available branch 2 times, most recently from b487995 to 31ced74 Compare April 11, 2024 11:35
@skonto skonto force-pushed the propagate_rev_not_available branch from 31ced74 to 3b7524c Compare April 11, 2024 11:36
@dprotaso
Copy link
Member

closing reopening to trigger new github actions to run

@dprotaso dprotaso closed this Apr 22, 2024
@dprotaso dprotaso reopened this Apr 22, 2024
@dprotaso
Copy link
Member

/retest

@dprotaso
Copy link
Member

Looks like the failures are legit in the api package

@dprotaso
Copy link
Member

dprotaso commented Apr 23, 2024

/hold for release tomorrow (if prow works ;) )

@knative-prow knative-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 23, 2024
@skonto
Copy link
Contributor Author

skonto commented Apr 26, 2024

It needs more work.

@skonto skonto changed the title If deployment is never available propagate the container msg [wip] If deployment is never available propagate the container msg Apr 26, 2024
@knative-prow knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 26, 2024
Copy link

knative-prow bot commented May 21, 2024

@skonto: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
istio-latest-no-mesh_serving_main 3b7524c link true /test istio-latest-no-mesh
certmanager-integration-tests_serving_main 3b7524c link true /test certmanager-integration-tests

Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/API API objects and controllers area/autoscale area/test-and-release It flags unit/e2e/conformance/perf test issues for product features do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error for failed revision is not reported due to scaling to zero
4 participants