Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Activities get stuck on "Created" if WORKER_GROUP doesn't exist / not running #3706

Open
johnkm516 opened this issue May 13, 2024 · 2 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@johnkm516
Copy link

Describe the issue

Activities using WORKER_GROUP for Kestra Enterprise get stuck on the "Created" status indefinitely if the WORKER_GROUP is not running / doesn't exist. The activity ignores all timeouts, and the flow will get stuck on "Running" status unless killed by the user.

Example :

id: worker_group_test
namespace: dev

labels:
  env: dev

tasks:

  - id: wait
    type: io.kestra.plugin.scripts.shell.Commands
    commands:
      - sleep 10
    docker: {}
    runner: PROCESS
    timeout: 1
    workerGroup:
      key: NONEXISTANT_WORKER_GROUP
  - id: print_status
    type: io.kestra.core.tasks.log.Log
    message: hello

Expected Behavior :

The activity should automatically fail if the activity is stuck on "Creating" for a set amount of time, or respect the timeout of the activity.

Environment

  • Kestra Version: 0.16.6
  • Operating System (OS/Docker/Kubernetes): Docker
@johnkm516 johnkm516 added the bug Something isn't working label May 13, 2024
@loicmathieu
Copy link
Member

The timeout is handled by the Woker, as the worker group didn't exist; no worker will handle the task, so the timeout cannot be hit.

We have an opened issue in our internal repository about that but I keep this public one opened for you to have feedback.

@loicmathieu loicmathieu added this to the v0.18.0 milestone May 13, 2024
@johnkm516
Copy link
Author

The timeout is handled by the Woker, as the worker group didn't exist; no worker will handle the task, so the timeout cannot be hit.

We have an opened issue in our internal repository about that but I keep this public one opened for you to have feedback.

Hi @loicmathieu ,
Thank you for your response.

As worker groups can be on different server racks, I think there should be some sort of timeout outside of the task execution at the executor so that the flow fails if the task cannot be executed on the worker group. If the flow doesn't fail and continue indefinitely, it will be difficult to monitor and know if a flow is failing due to issues on a different VM or server rack where the worker group is located.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants