Artifact GC still acting up for failed Workflows #12845
Replies: 13 comments 24 replies
-
I haven't used or worked on Artifact GC, but can try to help
No, I'm reading that as the Artifact GC Pod failed and so the Workflow has not been cleaned up properly and so has that message on it. That's also why it didn't finish deleting (as it failed to clean up), if I'm understanding correctly, as you don't have
I think you are correct -- would you like to submit a PR to fix that?
The Artifact GC Pod log message is the same log message you get when you try to add |
Beta Was this translation helpful? Give feedback.
-
I would, but I couldn't find the correct location yet.
That is correct. I might consider enabling this once this kind of behavior is an exception and doesn't happen for every failed workflow. Otherwise, my artifact repository fills up with orphaned artifacts.
Are you sure? I mean, look at point 6 again. In the logs are all necessary details for the artifact repository. The only place where this is can come from is the controller-level artifact repository configuration. If that component/container would be unable to get that config (somehow), it wouldn't be able to write those logs. Based on the fact, that the container can write those logs, I would assume, the config is accessible, but there is another problem hidden behind that generic error message. But I can't fix it if I don't know what it is.
Argo runs in the |
Beta Was this translation helpful? Give feedback.
-
Yes, I'm aware :) The
It should, yes. Otherwise even the successful workflows wouldn't be able to delete artifacts. But I was thinking about that as well. I still have that on my todo list to verify if there are any errors in the audit logs of MinIO while that happens. It's one of the very few ways to potentially make whatever-this-is visible.
I looked at the code and I don't see a reason, why this would suddenly fail? 🤔 Also: What is the
What could possibly make the GC work for one artifact but not for another? 🤔
What does that mean exactly? I can check, when I know what I am looking for. Thanks btw. for all the answers. Hopefully I can fix this some day. It's driving me crazy. Since I can easily implement a workaround for successful workflows (exits handler), failed workflows are the main(!) reason, why I need this feature. |
Beta Was this translation helpful? Give feedback.
-
Just an idea: is it possible, that for failed workflows, somehow additional Kubernetes resources are being created/used that my workflow service account can't access, and that's why the artifact repository seems to be not configured? This is the role, used by the workflow service account. By any chance, is there some important permission missing? apiVersion: "rbac.authorization.k8s.io/v1"
kind: "Role"
metadata:
name: "workflow-argo"
rules:
# See https://argoproj.github.io/argo-workflows/workflow-rbac/
- apiGroups:
- "argoproj.io"
resources:
- "workflowtaskresults"
verbs:
- "create"
- "patch"
- apiGroups:
- "argoproj.io"
resources:
- "workflows"
verbs:
# Permissions to submit a workflow
- "list"
- "create"
# Permissions to resubmit, retry, resume, suspend a workflow
- "get"
- "update"
# See https://argoproj.github.io/argo-workflows/walk-through/artifacts/#service-accounts-and-annotations
- apiGroups:
- "argoproj.io"
resources:
- "workflowartifactgctasks"
verbs:
- "list"
- "watch"
# See https://argoproj.github.io/argo-workflows/walk-through/artifacts/#service-accounts-and-annotations
- apiGroups:
- "argoproj.io"
resources:
- "workflowartifactgctasks/status"
verbs:
- "patch"
|
Beta Was this translation helpful? Give feedback.
-
@static-moonlight can you confirm that the artifact still exists in the repository? Have you done a manual inspection of it? I assume the artifactgc finalizer doesn't get removed, right? |
Beta Was this translation helpful? Give feedback.
-
@static-moonlight I just experienced an artifactgc failure for one of my failed workflows. Will investigate and report back. |
Beta Was this translation helpful? Give feedback.
-
@static-moonlight is your workflow controller configmap syntax correct? Notably:
https://argo-workflows.readthedocs.io/en/latest/workflow-controller-configmap/ |
Beta Was this translation helpful? Give feedback.
-
@static-moonlight I've added a test specific to failed workflow artifact garbage collection. It seems to be working. #12904 |
Beta Was this translation helpful? Give feedback.
-
It should. Otherwise, my editor would most likely show me syntax errors, and nothing would work in the cluster. I don't like the inline yaml config ( apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: argo
# [...]
configMapGenerator:
- name: workflow-controller-configmap
behavior: replace
files:
- config=config.yaml
# [...] config.yaml: # [...]
artifactRepository:
s3:
bucket: argo-artifacts
endpoint: s3.storage:9999
insecure: true
accessKeySecret:
name: artifact-repository
key: USERNAME
secretKeySecret:
name: artifact-repository
key: PASSWORD
# [...] It works nice though for normal operation. Workflows are being executed. The artifact repository only contains artifacts for active workflows. Which means it works ok. Because of our latest stabilization, workflow errors are kind of rare now. I'm currently working on a dedicated setup to forcefully fail a few workflow runs to hopefully find the missing puzzle piece which is causing all this trouble ... |
Beta Was this translation helpful? Give feedback.
-
I have a suspicion that it has something to do with this: "Pod was active on the node longer than the specified deadline" I have collected roughly 200 defect input files (which will cause the workflow to fail) and threw them at the cluster. Most of those workflows failed, but not because they couldn't handle the input, but because they took too long an hit the configured "activeDeadlineSeconds" of the pod. And for all those failed workflows, I see the "Artifact garbage collection failed" message again. ... I'm doing more tests on this one. |
Beta Was this translation helpful? Give feedback.
-
Well, that doesn't seem to be it either. When I try to specifically fail a single workflow by setting a pods So far it seems that there is a random factor involved. Sometimes it works and sometimes it doesn't. In high-load situations I could produce lots of artifact-gc errors. During normal operation I almost never have them. I'm a little lost here. I can't seem to find the magic recipe to reliably (re)produce this behavior. But apparently it still happens sometimes. Any more ideas? |
Beta Was this translation helpful? Give feedback.
-
I did the same test again. This time I limited the number of parallel workflows with a semaphore. The result: green across the board. The workflows failed, as expected, and they were removed successfully, including artifact gc. This amplifies my suspicion that workload has something to do with the issue. I would even throw the theory of a potential race condition into the ring. All I can do now is to do incremental tests, increasing the number of parallel workflows each time, and look when it starts to break ... EDIT: The intended limit (for parallel workflows) is set to 5. I increased it to 10, 15 and now 20. With the limit of 20, things start to break. I have a couple left-overs from this test run ... with "Artifact garbage collection failed" errors again. I assume the number of artifact-gc-errors increases when I put even more load on the cluster. Meaning: Load is obviously a factor. Can someone confirm this? |
Beta Was this translation helpful? Give feedback.
-
@juliev0 @shuangkun thoughts? |
Beta Was this translation helpful? Give feedback.
-
I still have problems with the artifact GC with Argo Workflows 3.5.5.
artifact-repository
This is working fine, as long as there are no errors. I can see artifacts being created and automatically removed in that bucket (for successful workflows).
"Artifact garbage collection failed"
"ArtifactGCError: Artifact Garbage Collection failed for strategy [...], pod OnWorkflowDeletion exited with non-zero exit code: check pod logs for more information".
The way how this reads is that the GC failed, because the workflow failed?!
(BTW: I think this error message is messed up as well: "[...] failed for strategy {pod-name}, pod {strategy} [...]". I think the placeholders are wrong)
[...]-artgc-[...]
container I found this:So, in conclusion: the artifact GC failed, for an unknown reason. The
[...]-artgc-[...]
container doesn't log anything specific, but it tells me to configure an artifact storage, which I already did (see point 1), and doesn't log any errors whatsoever. And remember: the artifact repository works absolutely fine for successful workflows. Which means: the S3 service and bucket are accessible, the credentials are correct, the required S3 permissions are working as well.So what is the problem here? What am I missing? Why is the GC somehow not working for failed workflows?
Beta Was this translation helpful? Give feedback.
All reactions