-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow does not complete even though Pod completed in 3.5.5 #12733
Comments
It seem like argo-workflows/workflow/executor/executor.go Lines 820 to 828 in 7286d49
argo-workflows/workflow/executor/executor.go Lines 797 to 807 in 7286d49
argo-workflows/workflow/controller/taskresult.go Lines 66 to 73 in 7286d49
@shuangkun It seems like a regression bug releated to #12537 . |
We were experiencing the same issue, we rollbacked to version |
Does the pod was set |
I reproduced this bug by modify the rules:
- apiGroups:
- argoproj.io
resources:
- workflowtaskresults
verbs:
- patch logs of time="2024-03-05T11:52:21.280Z" level=info msg="Starting Workflow Executor" version=untagged
time="2024-03-05T11:52:21.285Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-03-05T11:52:21.285Z" level=info msg="Executor initialized" deadline="2024-03-05 11:57:19 +0000 UTC" includeScriptOutput=false namespace=argo podName=wonderful-rhino templateName=argosay version="&Version{Version:untagged,BuildDate:2024-03-01T07:42:24Z,GitCommit:de0a271708cfa990e7232afdb83dbc8825933930,GitTag:untagged,GitTreeState:clean,GoVersion:go1.21.7,Compiler:gc,Platform:linux/amd64,}"
time="2024-03-05T11:52:21.290Z" level=warning msg="failed to patch task set, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/latest/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:argo:default\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"argo\""
time="2024-03-05T11:52:21.300Z" level=info msg="Starting deadline monitor"
time="2024-03-05T11:52:23.302Z" level=info msg="Main container completed" error="<nil>"
time="2024-03-05T11:52:23.302Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-03-05T11:52:23.302Z" level=info msg="No output parameters"
time="2024-03-05T11:52:23.302Z" level=info msg="No output artifacts"
time="2024-03-05T11:52:23.302Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: wonderful-rhino/wonderful-rhino/main.log"
time="2024-03-05T11:52:23.302Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-03-05T11:52:23.302Z" level=info msg="Saving file to s3" bucket=my-bucket endpoint="minio:9000" key=wonderful-rhino/wonderful-rhino/main.log path=/tmp/argo/outputs/logs/main.log
time="2024-03-05T11:52:23.312Z" level=info msg="Save artifact" artifactName=main-logs duration=9.81889ms error="<nil>" key=wonderful-rhino/wonderful-rhino/main.log
time="2024-03-05T11:52:23.312Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2024-03-05T11:52:23.312Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2024-03-05T11:52:23.313Z" level=warning msg="failed to patch task set, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/latest/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:argo:default\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"argo\""
time="2024-03-05T11:52:23.323Z" level=info msg="Alloc=6796 TotalAlloc=12606 Sys=20325 NumGC=5 Goroutines=10"
time="2024-03-05T11:52:23.325Z" level=warning msg="Non-transient error: workflowtaskresults.argoproj.io \"wonderful-rhino\" not found"
time="2024-03-05T11:52:23.325Z" level=error msg="executor error: workflowtaskresults.argoproj.io \"wonderful-rhino\" not found" I think this issue may be |
Yes, we had the same idea, and I just reproduced it using the method. |
The environment at the time of the report has reverted to v3.5.4. % k logs wonderful-rhino-argosay-67614653
time="2024-03-06T02:52:44.379Z" level=info msg="capturing logs" argo=true
hello argo
time="2024-03-06T02:52:45.381Z" level=info msg="sub-process exited" argo=true error="<nil>" apiVersion: v1
kind: Pod
metadata:
annotations:
kubectl.kubernetes.io/default-container: main
workflows.argoproj.io/node-id: wonderful-rhino-67614653
workflows.argoproj.io/node-name: wonderful-rhino(0)
workflows.argoproj.io/outputs: '{"artifacts":[{"name":"main-logs","s3":{"key":"wonderful-rhino/wonderful-rhino-argosay-67614653/main.log"}}]}'
workflows.argoproj.io/report-outputs-completed: "true" % kubectl logs wonderful-rhino-argosay-67614653 --container wait
time="2024-03-06T02:52:35.289Z" level=info msg="Starting Workflow Executor" version=v3.5.5
time="2024-03-06T02:52:35.291Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-03-06T02:52:35.291Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=default podName=wonderful-rhino-argosay-67614653 templateName=argosay version="&Version{Version:v3.5.5,BuildDate:2024-02-29T21:00:43Z,GitCommit:c80b2e91ebd7e7f604e88442f45ec630380effa0,GitTag:v3.5.5,GitTreeState:clean,GoVersion:go1.21.7,Compiler:gc,Platform:linux/arm64,}"
time="2024-03-06T02:52:35.294Z" level=warning msg="failed to patch task set, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:default:argo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"default\""
time="2024-03-06T02:52:35.301Z" level=info msg="Starting deadline monitor"
time="2024-03-06T02:52:46.308Z" level=info msg="Main container completed" error="<nil>"
time="2024-03-06T02:52:46.308Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-03-06T02:52:46.308Z" level=info msg="No output parameters"
time="2024-03-06T02:52:46.308Z" level=info msg="No output artifacts"
time="2024-03-06T02:52:46.309Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: wonderful-rhino/wonderful-rhino-argosay-67614653/main.log"
time="2024-03-06T02:52:46.314Z" level=info msg="Creating minio client using static credentials" endpoint=s3.amazonaws.com
time="2024-03-06T02:52:46.314Z" level=info msg="Saving file to s3" bucket=panicboat-sandbox-723535945756 endpoint=s3.amazonaws.com key=wonderful-rhino/wonderful-rhino-argosay-67614653/main.log path=/tmp/argo/outputs/logs/main.log
time="2024-03-06T02:52:46.508Z" level=info msg="Save artifact" artifactName=main-logs duration=198.899084ms error="<nil>" key=wonderful-rhino/wonderful-rhino-argosay-67614653/main.log
time="2024-03-06T02:52:46.508Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2024-03-06T02:52:46.508Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2024-03-06T02:52:46.510Z" level=warning msg="failed to patch task set, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:default:argo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"default\""
time="2024-03-06T02:52:46.522Z" level=info msg="Alloc=11144 TotalAlloc=17418 Sys=24421 NumGC=4 Goroutines=10"
time="2024-03-06T02:52:46.523Z" level=warning msg="failed to patch task set, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io \"wonderful-rhino-67614653\" is forbidden: User \"system:serviceaccount:default:argo\" cannot patch resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"default\""
time="2024-03-06T02:52:46.530Z" level=info msg="Deadline monitor stopped"
time="2024-03-06T02:52:46.530Z" level=info msg="stopping progress monitor (context done)" error="context canceled" |
Thanks. Does this workflow works well? I see the pod has label AnnotationKeyReportOutputsCompleted. Wait container only |
This workflow was also not completed. |
We were investigating in the wrong direction. The main issue was that the workflow example provided in the issue was not the actual one being executed. metadata:
name: wonderful-rhino
labels:
example: 'true'
spec:
arguments:
parameters:
- name: message
value: hello argo
entrypoint: argosay
templates:
- name: argosay
retryStrategy:
affinity: {}
limit: 3
retryPolicy: Always
inputs:
parameters:
- name: message
value: '{{workflow.parameters.message}}'
container:
name: main
image: argoproj/argosay:v2
command:
- /argosay
args:
- echo
- '{{inputs.parameters.message}}'
ttlStrategy:
secondsAfterCompletion: 300 |
@shuangkun argo-workflows/workflow/controller/operator.go Line 1131 in c80b2e9
argo-workflows/workflow/controller/operator.go Lines 1388 to 1396 in c80b2e9
Why release-v3.5.5 missed this line? argo-workflows/workflow/controller/operator.go Line 1394 in e00abd1
|
Yes, I found and fixed this problem before. |
@isubasinghe Can you have a look for this? I submitted a PR #12537 before, I find this line is right(podName -> nodeId) argo-workflows/workflow/controller/operator.go Line 1394 in e00abd1
|
Thanks for investigating @shuangkun and @jswxstw ! I ran a diff and, as suspected, there would've been a merge conflict on that line as there were other changes around it: # git diff release-3.5 main -- workflow/controller/operator.go
# [...]
- woc.log.Warn("workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/latest/workflow-rbac/")
- resultName := woc.nodeID(pod)
+ woc.log.Warn("workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/")
+ resultName := pod.GetName()
# [...] So the merge conflict probably got resolved incorrectly unfortunately. I also missed that when I was checking Isitha's work -- that one's definitely a hard one to manually detect while parsing through large diffs. @shuangkun could you file a PR to the |
#12755. Thanks. I add a pr. Do you mean we should add a test to prevent this from happening? If So, I can add a test. |
on deployments using emissary, this is solved by updating Role associated to serviceAccount that are linked to argo nodes with this https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/ (requirement for >= v3.4) just adding this rule to the role linked to every single SA that is being used by argo workflows, fixes everything: - apiGroups:
- argoproj.io
resources:
- workflowtaskresults
verbs:
- create
- patch Regarding to documentation, it looks more than there is needed to specify the need of inclusion of this rule everywhere, more than rolling back code. By the way, the default Probably a missalignment between doc, the default installation manifest, and this upgrade have been made from v3.4? |
Yes, it is now possible to solve the problem by increasing permissions so that executors use workflowtasks instead of pod annotations to report output. |
This rule is in |
The PR doesn't roll back code, it fixes a merge conflict that was incorrectly resolved in 3.5.5.
There were some small changes to the docs coming out of #12391, specifically #12445 and then #12680. I hope that is enough, but perhaps it needs to be referenced in other places too. I am thinking of adding it to the FAQ.
I think there's a misunderstanding here, only the quick start manifests contain a Role for the Executor. Neither the The |
@shuangkun yes I meant adding a regression test, specifically checking that the Going to merge the PR cleanly for now as it is a backport, but if you could add a test in a separate PR to |
@shuangkun following up on the above regarding tests |
Pre-requisites
:latest
What happened/what did you expect to happen?
Workflow is submitted and waits for a while, but does not finish.
Any Workflow looks the same.
When I check the status of the pod, it says Completed and the logs appear to be complete.
Also, even though the pod does not appear to be running in the UI, the logs can still be retrieved in the UI.
It appears that we can no longer track the status of the Pod.
Version
v3.5.5
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: