Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow does not complete even though Pod completed in 3.5.5 #12733

Closed
3 of 4 tasks
panicboat opened this issue Mar 4, 2024 · 21 comments · Fixed by #12755
Closed
3 of 4 tasks

Workflow does not complete even though Pod completed in 3.5.5 #12733

panicboat opened this issue Mar 4, 2024 · 21 comments · Fixed by #12755
Assignees
Labels
area/controller Controller issues, panics P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority solution/suggested A solution to the bug has been suggested. Someone needs to implement it. solution/workaround There's a workaround, might not be great, but exists type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@panicboat
Copy link
Contributor

panicboat commented Mar 4, 2024

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issue exists when I tested with :latest
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

Workflow is submitted and waits for a while, but does not finish.
Any Workflow looks the same.
When I check the status of the pod, it says Completed and the logs appear to be complete.
Also, even though the pod does not appear to be running in the UI, the logs can still be retrieved in the UI.
It appears that we can no longer track the status of the Pod.

image

01054855@CF0313 minikube % k get po -n develop-workflow                               
NAME                                                                READY   STATUS      RESTARTS   AGE
p3-deletion-policy-mask-or-delete-user-data-66xf6-main-2457154088   0/2     Completed   0          6h12m
shinkobetsu-check-expired-enrollments-1709510400-main-933938806     0/2     Completed   0          7h45m
wonderful-rhino-argosay-67614653                                    0/2     Completed   0          2m23s
01054855@CF0313 minikube % k logs -n develop-workflow wonderful-rhino-argosay-67614653
time="2024-03-04T07:43:36.342Z" level=info msg="capturing logs" argo=true
hello argo
time="2024-03-04T07:43:37.343Z" level=info msg="sub-process exited" argo=true error="<nil>"

Version

v3.5.5

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

metadata:
  name: wonderful-rhino
  namespace: develop-workflow
  labels:
    example: 'true'
spec:
  arguments:
    parameters:
      - name: message
        value: hello argo
  entrypoint: argosay
  templates:
    - name: argosay
      inputs:
        parameters:
          - name: message
            value: '{{workflow.parameters.message}}'
      container:
        name: main
        image: argoproj/argosay:v2
        command:
          - /argosay
        args:
          - echo
          - '{{inputs.parameters.message}}'
  ttlStrategy:
    secondsAfterCompletion: 300
  podGC:
    strategy: OnPodCompletion

Logs from the workflow controller

It's large, so I'm attaching it.

Logs from in your workflow's wait container

No resources found in argo namespace.
@panicboat
Copy link
Contributor Author

wonderful-rhino.log

@agilgur5 agilgur5 changed the title Workflow does not terminate. Workflow does not complete Mar 4, 2024
@agilgur5 agilgur5 changed the title Workflow does not complete Workflow does not complete even though Pod completed Mar 5, 2024
@jswxstw
Copy link
Member

jswxstw commented Mar 5, 2024

time="2024-03-04T07:43:46.402Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" namespace=develop-workflow workflow=wonderful-rhino

It seem like wait container has no access permission to TaskResult, @panicboat can you retrieve logs of wait container and AnnotationKeyReportOutputsCompleted annotation of the pod?

}, errorsutil.IsTransientErr, func() error {
err := we.upsertTaskResult(ctx, wfv1.NodeResult{})
if apierr.IsForbidden(err) {
log.WithError(err).Warn("failed to patch task set, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/latest/workflow-rbac/")
// Only added as a backup in case LabelKeyReportOutputsCompleted could not be set
err = we.AddAnnotation(ctx, common.AnnotationKeyReportOutputsCompleted, "false")
}
return err
})

}, errorsutil.IsTransientErr, func() error {
err := we.patchTaskResultLabels(ctx, map[string]string{
common.LabelKeyReportOutputsCompleted: "true",
})
if apierr.IsForbidden(err) {
log.WithError(err).Warn("failed to patch task set, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/latest/workflow-rbac/")
// Only added as a backup in case LabelKeyReportOutputsCompleted could not be set
err = we.AddAnnotation(ctx, common.AnnotationKeyReportOutputsCompleted, "true")
}
return err
})

time="2024-03-04T07:43:46.402Z" level=info msg="pod reconciliation didn't complete, will retry" namespace=develop-workflow workflow=wonderful-rhino

taskResultIncomplete is always true, so podReconciliation shows it did not complete and is retrying indefinitely.

// If the task result is completed, set the state to true.
if result.Labels[common.LabelKeyReportOutputsCompleted] == "true" {
woc.log.Debugf("Marking task result complete %s", resultName)
woc.wf.Status.MarkTaskResultComplete(resultName)
} else {
woc.log.Debugf("Marking task result incomplete %s", resultName)
woc.wf.Status.MarkTaskResultIncomplete(resultName)
}

@shuangkun It seems like a regression bug releated to #12537 .

@devcrops-official
Copy link

We were experiencing the same issue, we rollbacked to version 3.5.4

@shuangkun
Copy link
Member

shuangkun commented Mar 5, 2024

Does the pod was set common.AnnotationKeyReportOutputsCompleted, "true"? @panicboat we need wait container logs to debug the problem. Thanks!

@jswxstw
Copy link
Member

jswxstw commented Mar 5, 2024

I reproduced this bug by modify the rbac of executor as below:

rules:
- apiGroups:
  - argoproj.io
  resources:
  - workflowtaskresults
  verbs:
  - patch

logs of wait container:

time="2024-03-05T11:52:21.280Z" level=info msg="Starting Workflow Executor" version=untagged
time="2024-03-05T11:52:21.285Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-03-05T11:52:21.285Z" level=info msg="Executor initialized" deadline="2024-03-05 11:57:19 +0000 UTC" includeScriptOutput=false namespace=argo podName=wonderful-rhino templateName=argosay version="&Version{Version:untagged,BuildDate:2024-03-01T07:42:24Z,GitCommit:de0a271708cfa990e7232afdb83dbc8825933930,GitTag:untagged,GitTreeState:clean,GoVersion:go1.21.7,Compiler:gc,Platform:linux/amd64,}"
time="2024-03-05T11:52:21.290Z" level=warning msg="failed to patch task set, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/latest/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:argo:default\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"argo\""
time="2024-03-05T11:52:21.300Z" level=info msg="Starting deadline monitor"
time="2024-03-05T11:52:23.302Z" level=info msg="Main container completed" error="<nil>"
time="2024-03-05T11:52:23.302Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-03-05T11:52:23.302Z" level=info msg="No output parameters"
time="2024-03-05T11:52:23.302Z" level=info msg="No output artifacts"
time="2024-03-05T11:52:23.302Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: wonderful-rhino/wonderful-rhino/main.log"
time="2024-03-05T11:52:23.302Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-03-05T11:52:23.302Z" level=info msg="Saving file to s3" bucket=my-bucket endpoint="minio:9000" key=wonderful-rhino/wonderful-rhino/main.log path=/tmp/argo/outputs/logs/main.log
time="2024-03-05T11:52:23.312Z" level=info msg="Save artifact" artifactName=main-logs duration=9.81889ms error="<nil>" key=wonderful-rhino/wonderful-rhino/main.log
time="2024-03-05T11:52:23.312Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2024-03-05T11:52:23.312Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2024-03-05T11:52:23.313Z" level=warning msg="failed to patch task set, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/latest/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:argo:default\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"argo\""
time="2024-03-05T11:52:23.323Z" level=info msg="Alloc=6796 TotalAlloc=12606 Sys=20325 NumGC=5 Goroutines=10"
time="2024-03-05T11:52:23.325Z" level=warning msg="Non-transient error: workflowtaskresults.argoproj.io \"wonderful-rhino\" not found"
time="2024-03-05T11:52:23.325Z" level=error msg="executor error: workflowtaskresults.argoproj.io \"wonderful-rhino\" not found"

I think this issue may be rbac problem, rather than a regression bug, but need more information to confirm it.

@shuangkun
Copy link
Member

shuangkun commented Mar 5, 2024

Yes, we had the same idea, and I just reproduced it using the method.

@shuangkun shuangkun added the problem/more information needed Not enough information has been provide to diagnose this issue. label Mar 5, 2024
@panicboat
Copy link
Contributor Author

panicboat commented Mar 6, 2024

The environment at the time of the report has reverted to v3.5.4.
Can I use data acquired in a different environment?

image

% k logs wonderful-rhino-argosay-67614653
time="2024-03-06T02:52:44.379Z" level=info msg="capturing logs" argo=true
hello argo
time="2024-03-06T02:52:45.381Z" level=info msg="sub-process exited" argo=true error="<nil>"
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubectl.kubernetes.io/default-container: main
    workflows.argoproj.io/node-id: wonderful-rhino-67614653
    workflows.argoproj.io/node-name: wonderful-rhino(0)
    workflows.argoproj.io/outputs: '{"artifacts":[{"name":"main-logs","s3":{"key":"wonderful-rhino/wonderful-rhino-argosay-67614653/main.log"}}]}'
    workflows.argoproj.io/report-outputs-completed: "true"
% kubectl logs wonderful-rhino-argosay-67614653 --container wait
time="2024-03-06T02:52:35.289Z" level=info msg="Starting Workflow Executor" version=v3.5.5
time="2024-03-06T02:52:35.291Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-03-06T02:52:35.291Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=default podName=wonderful-rhino-argosay-67614653 templateName=argosay version="&Version{Version:v3.5.5,BuildDate:2024-02-29T21:00:43Z,GitCommit:c80b2e91ebd7e7f604e88442f45ec630380effa0,GitTag:v3.5.5,GitTreeState:clean,GoVersion:go1.21.7,Compiler:gc,Platform:linux/arm64,}"
time="2024-03-06T02:52:35.294Z" level=warning msg="failed to patch task set, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:default:argo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"default\""
time="2024-03-06T02:52:35.301Z" level=info msg="Starting deadline monitor"
time="2024-03-06T02:52:46.308Z" level=info msg="Main container completed" error="<nil>"
time="2024-03-06T02:52:46.308Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-03-06T02:52:46.308Z" level=info msg="No output parameters"
time="2024-03-06T02:52:46.308Z" level=info msg="No output artifacts"
time="2024-03-06T02:52:46.309Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: wonderful-rhino/wonderful-rhino-argosay-67614653/main.log"
time="2024-03-06T02:52:46.314Z" level=info msg="Creating minio client using static credentials" endpoint=s3.amazonaws.com
time="2024-03-06T02:52:46.314Z" level=info msg="Saving file to s3" bucket=panicboat-sandbox-723535945756 endpoint=s3.amazonaws.com key=wonderful-rhino/wonderful-rhino-argosay-67614653/main.log path=/tmp/argo/outputs/logs/main.log
time="2024-03-06T02:52:46.508Z" level=info msg="Save artifact" artifactName=main-logs duration=198.899084ms error="<nil>" key=wonderful-rhino/wonderful-rhino-argosay-67614653/main.log
time="2024-03-06T02:52:46.508Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2024-03-06T02:52:46.508Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2024-03-06T02:52:46.510Z" level=warning msg="failed to patch task set, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:default:argo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"default\""
time="2024-03-06T02:52:46.522Z" level=info msg="Alloc=11144 TotalAlloc=17418 Sys=24421 NumGC=4 Goroutines=10"
time="2024-03-06T02:52:46.523Z" level=warning msg="failed to patch task set, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io \"wonderful-rhino-67614653\" is forbidden: User \"system:serviceaccount:default:argo\" cannot patch resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"default\""
time="2024-03-06T02:52:46.530Z" level=info msg="Deadline monitor stopped"
time="2024-03-06T02:52:46.530Z" level=info msg="stopping progress monitor (context done)" error="context canceled"

@shuangkun
Copy link
Member

Thanks. Does this workflow works well? I see the pod has label AnnotationKeyReportOutputsCompleted. Wait container only
just missing create rbac should work well.

@panicboat
Copy link
Contributor Author

This workflow was also not completed.

@jswxstw
Copy link
Member

jswxstw commented Mar 6, 2024

Thanks. Does this workflow works well? I see the pod has label AnnotationKeyReportOutputsCompleted. Wait container only just missing create rbac should work well.

We were investigating in the wrong direction. The main issue was that the workflow example provided in the issue was not the actual one being executed.
This bug can be reproduced by the following workflow:

metadata:
  name: wonderful-rhino
  labels:
    example: 'true'
spec:
  arguments:
    parameters:
      - name: message
        value: hello argo
  entrypoint: argosay
  templates:
    - name: argosay
      retryStrategy:
        affinity: {}
        limit: 3
        retryPolicy: Always
      inputs:
        parameters:
          - name: message
            value: '{{workflow.parameters.message}}'
      container:
        name: main
        image: argoproj/argosay:v2
        command:
          - /argosay
        args:
          - echo
          - '{{inputs.parameters.message}}'
  ttlStrategy:
    secondsAfterCompletion: 300

@jswxstw
Copy link
Member

jswxstw commented Mar 6, 2024

@shuangkun pod.name does not equal to node.ID in PodNameV2.

if newState.Succeeded() && woc.wf.Status.IsTaskResultIncomplete(node.ID) {

if x, ok := pod.Annotations[common.AnnotationKeyReportOutputsCompleted]; ok {
woc.log.Warn("workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/")
resultName := pod.GetName()
if x == "true" {
woc.wf.Status.MarkTaskResultComplete(resultName)
} else {
woc.wf.Status.MarkTaskResultIncomplete(resultName)
}
}

Why release-v3.5.5 missed this line?

resultName := woc.nodeID(pod)

@shuangkun
Copy link
Member

shuangkun commented Mar 6, 2024

Yes, I found and fixed this problem before.

@shuangkun
Copy link
Member

shuangkun commented Mar 6, 2024

@isubasinghe Can you have a look for this? I submitted a PR #12537 before, I find this line is right(podName -> nodeId)
https://github.com/argoproj/argo-workflows/blob/v3.5.5/pkg/apis/workflow/v1alpha1/workflow_types.go#L1943
but this line is missed in 3.5.5.

resultName := woc.nodeID(pod)
Thanks.

@agilgur5 agilgur5 added type/regression Regression from previous behavior (a specific type of bug) area/controller Controller issues, panics P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority and removed problem/more information needed Not enough information has been provide to diagnose this issue. labels Mar 6, 2024
@agilgur5
Copy link
Member

agilgur5 commented Mar 6, 2024

Thanks for investigating @shuangkun and @jswxstw !

I ran a diff and, as suspected, there would've been a merge conflict on that line as there were other changes around it:

# git diff release-3.5 main -- workflow/controller/operator.go
# [...]
-               woc.log.Warn("workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/latest/workflow-rbac/")
-               resultName := woc.nodeID(pod)
+               woc.log.Warn("workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/")
+               resultName := pod.GetName()
# [...]

So the merge conflict probably got resolved incorrectly unfortunately. I also missed that when I was checking Isitha's work -- that one's definitely a hard one to manually detect while parsing through large diffs.

@shuangkun could you file a PR to the release-3.5 branch to fix this? I'm also wondering (as I did on Slack) if a test should've caught that bug

@agilgur5 agilgur5 added the solution/suggested A solution to the bug has been suggested. Someone needs to implement it. label Mar 6, 2024
@shuangkun
Copy link
Member

shuangkun commented Mar 7, 2024

@shuangkun could you file a PR to the release-3.5 branch to fix this? I'm also wondering (as I did on Slack) if a test should've caught that bug

#12755. Thanks. I add a pr. Do you mean we should add a test to prevent this from happening? If So, I can add a test.

@Guillermogsjc
Copy link

Guillermogsjc commented Mar 7, 2024

on deployments using emissary, this is solved by updating Role associated to serviceAccount that are linked to argo nodes with this https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/ (requirement for >= v3.4)

just adding this rule to the role linked to every single SA that is being used by argo workflows, fixes everything:

  - apiGroups:
      - argoproj.io
    resources:
      - workflowtaskresults
    verbs:
      - create
      - patch

Regarding to documentation, it looks more than there is needed to specify the need of inclusion of this rule everywhere, more than rolling back code.

By the way, the default argo-role in https://github.com/argoproj/argo-workflows/releases/download/${ARGO_VERSION}/install.yaml, lacks too the rule.

Probably a missalignment between doc, the default installation manifest, and this upgrade have been made from v3.4?

@shuangkun
Copy link
Member

Probably a missalignment between doc, the default installation manifest, and this upgrade have been made from

Yes, it is now possible to solve the problem by increasing permissions so that executors use workflowtasks instead of pod annotations to report output.

@jswxstw
Copy link
Member

jswxstw commented Mar 7, 2024

By the way, the default argo-role in https://github.com/argoproj/argo-workflows/releases/download/${ARGO_VERSION}/install.yaml, lacks too the rule.

This rule is in executor role if use namespace-install.
However, this issue shows that cluster-install is used, so this rule is in admin clusterrole and there is a need to bind it.

@agilgur5
Copy link
Member

agilgur5 commented Mar 12, 2024

more than rolling back code.

The PR doesn't roll back code, it fixes a merge conflict that was incorrectly resolved in 3.5.5.

Regarding to documentation, it looks more than there is needed to specify the need of inclusion of this rule everywhere

There were some small changes to the docs coming out of #12391, specifically #12445 and then #12680. I hope that is enough, but perhaps it needs to be referenced in other places too. I am thinking of adding it to the FAQ.
Per #12391 (comment), the RBAC was always listed as needed in the docs, but prior to 3.5.2, some simple steps could bypass it, hence the recent confusion.

By the way, the default argo-role in https://github.com/argoproj/argo-workflows/releases/download/${ARGO_VERSION}/install.yaml, lacks too the rule.

This rule is in executor role if use namespace-install. However, this issue shows that cluster-install is used, so this rule is in admin clusterrole and there is a need to bind it.

I think there's a misunderstanding here, only the quick start manifests contain a Role for the Executor.

Neither the cluster-install (which is install.yaml) nor the namespace-install provide a Role for the Executor. That is up to the user to set-up for multiple reasons. The workflowtaskresults RBAC is the minimum required, but more could be necessary depending on your use-case (e.g. creating a k8s resource). In the quick start, the Role is bound to the default SA, but in a production install, default shouldn't be used and what SA a user uses is not known ahead of time by the manifests. (Note that the Helm charts, which can be configured, have an optional Executor SA)

The argo-role from the cluster-install is the Controller's Role -- it binds to the argo SA -- not the Executor's Role.

@agilgur5
Copy link
Member

agilgur5 commented Mar 12, 2024

#12755. Thanks. I add a pr. Do you mean we should add a test to prevent this from happening? If So, I can add a test.

@shuangkun yes I meant adding a regression test, specifically checking that the nodeID is being used. Does pod.GetName instead of nodeID not cause other issues?

Going to merge the PR cleanly for now as it is a backport, but if you could add a test in a separate PR to main that'd be great. Although I am now wondering... do we already have a test for this in main but not in the release branch? 🤔

agilgur5 pushed a commit that referenced this issue Mar 12, 2024
@agilgur5 agilgur5 changed the title Workflow does not complete even though Pod completed Workflow does not complete even though Pod completed in 3.5.5 Mar 12, 2024
@agilgur5 agilgur5 added the solution/workaround There's a workaround, might not be great, but exists label Mar 12, 2024
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Apr 3, 2024
@agilgur5
Copy link
Member

@shuangkun following up on the above regarding tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority solution/suggested A solution to the bug has been suggested. Someone needs to implement it. solution/workaround There's a workaround, might not be great, but exists type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants