Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daemon pods keep running after workflow DAG fails when using failFast: true #10313

Open
2 of 3 tasks
igorcalabria opened this issue Jan 5, 2023 · 1 comment · May be fixed by #12871
Open
2 of 3 tasks

Daemon pods keep running after workflow DAG fails when using failFast: true #10313

igorcalabria opened this issue Jan 5, 2023 · 1 comment · May be fixed by #12871
Labels
P3 Low priority type/bug

Comments

@igorcalabria
Copy link

igorcalabria commented Jan 5, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

I expect the daemon pod to be terminated when the workflow fails, but that's not the case. The workflow is correctly marked as failed but the daemon pod keeps running until the workflow is deleted. I think it tries to delete the daemon, but it's getting a 404 response (from controller):

time="2023-01-05T18:17:43.909Z" level=info msg="Checking daemoned children of " namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.914Z" level=info msg="cleaning up pod" action=deletePod key=argo/daemon-nginx-7m8fc-1340600742-agent/deletePod
time="2023-01-05T18:17:43.915Z" level=info msg="Delete pods 404"

Some other notes:

  • This is unrelated to this issue, but should daemon containers count towards the dag parallelism? In this case I wanted parallelism of one, but if that's set the workflow gets stuck running just the daemon task
  • Without failFast the daemon pod is properly deleted
  • Even with just one item in withParams, the daemon pod is not properly deleted if it fails.

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: daemon-nginx-
  namespace: argo
spec:
  entrypoint: daemon-nginx-example
  templates:
  - name: daemon-nginx-example
    failFast: true
    parallelism: 2
    dag:
      tasks:
      - name: nginx-server
        template: nginx-server
      - name: nginx-client
        template: nginx-client
        depends: "nginx-server"
        withParam: |
          ["one", "two"]
        arguments:
          parameters:
          - name: server-ip
            value: "{{tasks.nginx-server.ip}}"
  - name: nginx-server
    daemon: true
    container:
      image: nginx:1.13
      readinessProbe:
        httpGet:
          path: /
          port: 80
        initialDelaySeconds: 2
        timeoutSeconds: 1
  - name: nginx-client
    inputs:
      parameters:
      - name: server-ip
    container:
      image: appropriate/curl:latest
      command: ["/bin/sh", "-c"]
      # Fail 
      args: ["aaaaaaaaaa"]

Logs from the workflow controller

time="2023-01-05T18:17:03.870Z" level=info msg="Processing workflow" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.880Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.880Z" level=info msg="DAG node daemon-nginx-7m8fc initialized Running" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.880Z" level=info msg="All of node daemon-nginx-7m8fc.nginx-server dependencies [] completed" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.880Z" level=info msg="Pod node daemon-nginx-7m8fc-1217350964 initialized Pending" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.886Z" level=info msg="Created pod: daemon-nginx-7m8fc.nginx-server (daemon-nginx-7m8fc-nginx-server-1217350964)" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.886Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.886Z" level=info msg=reconcileAgentPod namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.890Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=807268 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.886Z" level=info msg="Processing workflow" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.887Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.887Z" level=info msg="Node became daemoned" namespace=argo nodeId=daemon-nginx-7m8fc-1217350964 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.887Z" level=info msg="node changed" namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=daemon-nginx-7m8fc-1217350964 old.message= old.phase=Pending old.progress=0/1 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.887Z" level=info msg="TaskGroup node daemon-nginx-7m8fc-3902071824 initialized Running (message: )" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.887Z" level=info msg="All of node daemon-nginx-7m8fc.nginx-client(0:one) dependencies [nginx-server] completed" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.887Z" level=info msg="Pod node daemon-nginx-7m8fc-3898481205 initialized Pending" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.890Z" level=info msg="Created pod: daemon-nginx-7m8fc.nginx-client(0:one) (daemon-nginx-7m8fc-nginx-client-3898481205)" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.890Z" level=info msg="All of node daemon-nginx-7m8fc.nginx-client(1:two) dependencies [nginx-server] completed" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.890Z" level=info msg="template (node daemon-nginx-7m8fc) active children parallelism exceeded 2" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.890Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.890Z" level=info msg=reconcileAgentPod namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.899Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=807303 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=info msg="Processing workflow" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=info msg="node changed" namespace=argo new.message="Error (exit code 127)" new.phase=Failed new.progress=0/1 nodeID=daemon-nginx-7m8fc-3898481205 old.message= old.phase=Pending old.progress=0/1 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=info msg="node unchanged" namespace=argo nodeID=daemon-nginx-7m8fc-1217350964 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=info msg="node daemon-nginx-7m8fc phase Running -> Failed" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=info msg="node daemon-nginx-7m8fc message: template has failed or errored children and failFast enabled" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=info msg="node daemon-nginx-7m8fc finished: 2023-01-05 18:17:23.891758459 +0000 UTC" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=error msg="error in entry template execution" error="Max parallelism reached" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.895Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=807339 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.900Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/daemon-nginx-7m8fc-nginx-client-3898481205/labelPodCompleted
time="2023-01-05T18:17:33.895Z" level=info msg="Processing workflow" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.895Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.895Z" level=info msg="node unchanged" namespace=argo nodeID=daemon-nginx-7m8fc-1217350964 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.895Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.895Z" level=info msg=reconcileAgentPod namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.895Z" level=info msg="Updated phase Running -> Failed" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.895Z" level=info msg="Updated message  -> template has failed or errored children and failFast enabled" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.895Z" level=info msg="Checking daemoned children of " namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.901Z" level=info msg="cleaning up pod" action=deletePod key=argo/daemon-nginx-7m8fc-1340600742-agent/deletePod
time="2023-01-05T18:17:33.908Z" level=info msg="Workflow update successful" namespace=argo phase=Failed resourceVersion=807359 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.909Z" level=info msg="Processing workflow" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.909Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.909Z" level=info msg="node unchanged" namespace=argo nodeID=daemon-nginx-7m8fc-1217350964 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.909Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.909Z" level=info msg=reconcileAgentPod namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.909Z" level=info msg="Checking daemoned children of " namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.914Z" level=info msg="cleaning up pod" action=deletePod key=argo/daemon-nginx-7m8fc-1340600742-agent/deletePod

Logs from in your workflow's wait container

time="2023-01-05T18:17:17.185Z" level=info msg="Starting Workflow Executor" version=untagged
time="2023-01-05T18:17:17.188Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2023-01-05T18:17:17.188Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=argo podName=daemon-nginx-7m8fc-nginx-client-3898481205 template="{\"name\":\"nginx-client\",\"inputs\":{\"parameters\":[{\"name\":\"server-ip\",\"value\":\"10.244.0.13\"}]},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"appropriate/curl:latest\",\"command\":[\"/bin/sh\",\"-c\"],\"args\":[\"aaaaaaaaaa\"],\"resources\":{}}}" version="&Version{Version:untagged,BuildDate:2023-01-05T16:21:00Z,GitCommit:0f58387c79728b84037aa96221d1c97a974402a4,GitTag:untagged,GitTreeState:clean,GoVersion:go1.18.9,Compiler:gc,Platform:linux/amd64,}"
time="2023-01-05T18:17:17.188Z" level=info msg="Starting deadline monitor"
time="2023-01-05T18:17:20.190Z" level=info msg="Main container completed" error="<nil>"
time="2023-01-05T18:17:20.190Z" level=info msg="Deadline monitor stopped"
time="2023-01-05T18:17:20.190Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2023-01-05T18:17:20.190Z" level=info msg="No output parameters"
time="2023-01-05T18:17:20.190Z" level=info msg="No output artifacts"
time="2023-01-05T18:17:20.190Z" level=info msg="Alloc=6340 TotalAlloc=12280 Sys=19666 NumGC=4 Goroutines=5"
time="2023-01-05T18:17:06.942Z" level=info msg="Starting Workflow Executor" version=untagged
time="2023-01-05T18:17:06.944Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2023-01-05T18:17:06.944Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=argo podName=daemon-nginx-7m8fc-nginx-server-1217350964 template="{\"name\":\"nginx-server\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"daemon\":true,\"container\":{\"name\":\"\",\"image\":\"nginx:1.13\",\"resources\":{},\"readinessProbe\":{\"httpGet\":{\"path\":\"/\",\"port\":80},\"initialDelaySeconds\":2,\"timeoutSeconds\":1}}}" version="&Version{Version:untagged,BuildDate:2023-01-05T16:21:00Z,GitCommit:0f58387c79728b84037aa96221d1c97a974402a4,GitTag:untagged,GitTreeState:clean,GoVersion:go1.18.9,Compiler:gc,Platform:linux/amd64,}"
time="2023-01-05T18:17:06.944Z" level=info msg="Starting deadline monitor"
@sarabala1979 sarabala1979 added the P3 Low priority label Jan 13, 2023
@stale
Copy link

stale bot commented Sep 17, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the problem/stale This has not had a response in some time label Sep 17, 2023
@terrytangyuan terrytangyuan removed the problem/stale This has not had a response in some time label Sep 20, 2023
shuangkun added a commit to shuangkun/argo-workflows that referenced this issue Apr 2, 2024
…enabled. Fixes:argoproj#10313

Signed-off-by: shuangkun <tsk2013uestc@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 Low priority type/bug
Projects
None yet
3 participants