Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argo fails to upload larger output artifacts with sidecar service mesh like LinkerD #12919

Open
3 of 4 tasks
HasithaAthukorala opened this issue Apr 10, 2024 · 7 comments
Open
3 of 4 tasks
Assignees
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc area/executor P3 Low priority solution/workaround There's a workaround, might not be great, but exists type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@HasithaAthukorala
Copy link

HasithaAthukorala commented Apr 10, 2024

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issue exists when I tested with :latest
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

Bug:

  • I have added a pod annotation to kill the linkerd-proxy container - workflows.argoproj.io/kill-cmd-linkerd-proxy: ["/usr/lib/linkerd/linkerd-await", "--shutdown", "--", "sleep", "--help"]
  • Above mentioned command is executed just after the main container is completed, so, the wait container is only having the grace period to complete it's uploading of output artifacts to the S3
  • When the output artifacts are heavier or there is a lot of output artifacts to be uploaded, the wait container is stuck in the middle after the grace period while uploading the artifacts. This happens due to the death of the linkerd-proxy container

Ideal Solution:

  • The command to kill the linkerd-proxy container, should be executed after the wait container

Version

v3.5.5 (latest)

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

I used the Golang SDK to publish pipelines

  • You can use any workflow this
  • LinkerD should be installed in the cluster
  • Output artifacts should be heavier

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@HasithaAthukorala
Copy link
Author

HasithaAthukorala commented Apr 10, 2024

I tried this with the latest version available (v3.5.5), but the issue is still there, hence I opened this issue
But, I could solve this using sidecars - https://argo-workflows.readthedocs.io/en/latest/walk-through/sidecars/

Below is the hack I did to solve this (Hope this will help someone with the same issue):

Implemented a sidecar to do the following things

  1. Modify the pod annotation for Linkerd - I have configured the workflow to add a pod annotation to kill the linkerd-proxy container. (workflows.argoproj.io/kill-cmd-linkerd-proxy: ["/usr/lib/linkerd/linkerd-await", "--shutdown", "--", "sleep", "--help"]). I can't remove this annotation from the workflow, since this annotation is not only for the template pod where this new sidecar runs, but also for the finalizer pod, so, If remove this annotation the workflow will keep running since the linkerd-proxy container is never dying. Hence I added a logic to the sidecar to modify the pod annotation only in the template pod
  2. Modify the pod annotation for this new sidecar - Sidecars are killed just after the main container is completed. But, we can override those killing commands by adding another pod annotation. Here is the pod annotation I add - workflows.argoproj.io/kill-cmd-{name of the sidecar}: ["echo"]
  3. Watch for the status of the wait container
  4. Call the shutdown hook of the linkerd-proxy once the wait container is completed - curl -X POST http://localhost:4191/shutdown
  5. Complete the self pod

NOTE: You can take a look at the code here - https://github.com/HasithaAthukorala/argo-sidecar

@shuangkun shuangkun added the area/artifacts S3/GCP/OSS/Git/HDFS etc label Apr 10, 2024
@agilgur5 agilgur5 added the solution/workaround There's a workaround, might not be great, but exists label Apr 10, 2024
@agilgur5 agilgur5 changed the title [BUG] Argo fails to upload output artifacts (when the artifacts are heavier) when there is a service mesh like LinkerD Argo fails to upload larger output artifacts with service mesh like LinkerD Apr 10, 2024
@agilgur5 agilgur5 added P3 Low priority area/executor labels Apr 10, 2024
@agilgur5
Copy link
Member

agilgur5 commented Apr 10, 2024

Noting that this is a follow-up to this Slack thread

  • Above mentioned command is executed just after the main container is completed, so, the wait container is only having the grace period to complete it's uploading of output artifacts to the S3

I tried root causing this in the thread (see links to the source code there), but couldn't quite find where this happens.

@agilgur5 agilgur5 changed the title Argo fails to upload larger output artifacts with service mesh like LinkerD Argo fails to upload larger output artifacts with sidecar service mesh like LinkerD Apr 10, 2024
@HasithaAthukorala
Copy link
Author

@agilgur5 Is this related to that? I feel like this is the place where all the custom pod annotations are executed (killing all the sidecars).

go wait.UntilWithContext(ctx, wfc.runPodCleanup, time.Second)
}

@agilgur5
Copy link
Member

I mean, at a really high-level yes, but you have to go down several layers of the call stack to get to SignalContainer which actually runs the kill command.

The question is why is that occurring, as normally containers are only killed when the Workflow has been stopped or cancelled, or the step has exceed its deadline. In the codebase, that is when Pods are queued up with action terminateContainers or killContainers.

I suspect it is deeper than that; there seems to be an assumption somewhere that when the main container finishes, all containers should be killed and that wait executes after main is killed.

The solution to that seems non-trivial, the first kill should only take out main (and maybe others if you have sidecars or a containerSet?) and then there needs to be a second kill after wait finishes to get the remaining containers.

@agilgur5
Copy link
Member

agilgur5 commented Apr 25, 2024

I suspect it is deeper than that; there seems to be an assumption somewhere that when the main container finishes, all containers should be killed and that wait executes after main is killed.

Ok I found that assumption in the codebase which basically says that in the comments:

func podHasContainerNeedingTermination(pod *apiv1.Pod, tmpl wfv1.Template) bool {

This originates from the injected sidecar support PR #5383, which did check for the wait container specifically. Then a check for the main container was added in #6033. Then the wait check was moved in #8478

That code still exists, so this is possibly a regression from #8478? I also wonder if changes from #11484 makes that no longer necessary.

cc @alexec and @terrytangyuan -- this has gone through a lot of iterations, so I am hesitant to change it myself if that would fix one regression while causing another regression. This needs a really careful touch and both of you have been involved with it already

@agilgur5 agilgur5 added the type/regression Regression from previous behavior (a specific type of bug) label Apr 25, 2024
@HasithaAthukorala
Copy link
Author

Yes, I believe this is happening due to the feature, Sidecar Injection - https://argo-workflows.readthedocs.io/en/latest/sidecar-injection/

To solve this issue, I used above feature as a hack to inject another sidecar to kill the linkerd-proxy container. So, I could see that, all the sidecars are killed just after the main container is completed. That is also mentioned in the documentation as well - https://argo-workflows.readthedocs.io/en/latest/sidecar-injection/#:~:text=It%20can%20be%20impossible%20for%20the%20wait%20container%20to%20terminate%20the%20sidecar.

@agilgur5
Copy link
Member

agilgur5 commented Apr 25, 2024

Your comment is not quite correct, so I will correct it below. But it's not related to my root cause analysis above either, it's just a misunderstanding of the features.

Yes, I believe this is happening due to the feature, Sidecar Injection - https://argo-workflows.readthedocs.io/en/latest/sidecar-injection/

Er, that's the feature that makes using Sidecar Injection with Argo even work in the first place. Without it, Pods hang.

There's just a bug in it.

To solve this issue, I used above feature as a hack to inject another sidecar

As you wrote previously, you used regular Sidecars: https://argo-workflows.readthedocs.io/en/latest/walk-through/sidecars/.
"Injected" means that they're added by a different controller, LinkerD in your case, and not Argo. You didn't "inject" anything other than LinkerD, you just added a regular sidecar.

The "injection" is what makes service mesh sidecars cause bugs with lots of different tools, like Argo. The tools don't know about the injected sidecar as they didn't add them themselves, something else "injected" them. Often that means the tool now has to support them somehow in a wacky way, which is what Argo tries to do. Injected sidecars are a hack in and of themselves, and so often require more hacks to get working. That's why Istio's Ambient Mesh is a game changer, as it doesn't require injected sidecars, resolving a mountain of issues like these.

You did use the kill command feature, which is normally for injected sidecars, as a hack to allow your new sidecar to keep running. That is a clever workaround.

all the sidecars are killed just after the main container is completed. That is also mentioned in the documentation as well - https://argo-workflows.readthedocs.io/en/latest/sidecar-injection/#:~:text=It%20can%20be%20impossible%20for%20the%20wait%20container%20to%20terminate%20the%20sidecar.

That's not what that quote says. It says that the wait container itself cannot necessarily kill other containers. Your LinkerD sidecar isn't being killed by the wait container, it's being killed by the Argo Controller.

The interesting thing here is that you showed that a sidecar can kill it in this case since it's available over the local network, which you did in your workaround.
The wait container is itself a sidecar and could potentially do something similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc area/executor P3 Low priority solution/workaround There's a workaround, might not be great, but exists type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

No branches or pull requests

4 participants