Design decision - correct approach for costly pod starts #11719

RonaldGalea · 2023-08-30T13:13:36Z

RonaldGalea
Aug 30, 2023

Hello,

I wish to better understand the design decisions behind Argo. Specifically, the trade-offs between creating a completely separate pod instance to run each task as opposed to having a warm (but still dynamically scalable) pool of worker pods ready to accept requests (in some form).

Here are the points I can see for each approach.

Pod per task

Pros:

simple implementation
simple monitoring & error handling (watch what happens to the pod running the task)
strong isolation (no two tasks can interfere with each other)
very generic (a task just needs a container image and command to execute)

Cons:

start-up overheads: the pod start up time must be paid for each and every task

Pod worker pool

To keep things simple, let's assume single-threaded workers, so task concurrency per worker is just 1.

Pros:

warm instances are available, smaller latencies, better scalability

Cons:

complex implementation
monitoring and error handling becomes more complicated (clients probably will have to have some logic to support this)
though less likely when concurrency=1, tasks could still interfere with one another in some cases
harder to make things generic, one might need to implement clients in multiple languages.

My particular use case

I have a number of services that I need to chain together in some logical way which fits a DAG, so a Workflow Management tool like Argo seems to fit my use case well. However, some of these services have a relatively high start-up time (large container images, preparing machine learning models, etc.), and for these it would be really painful to tear down the warm environment and recreate it constantly.

Argo, as well as other Workflow Management systems that I've seen only support the Pod per task approach - but surely I'm not the only one with a use case that doesn't quite fit that well due to costly start-ups. So I have the following questions:

Why does Argo not support pod worker pools? Is this just not common enough to make it worth the implementation complexity or is there some fundamental anti-pattern that makes it unfeasible?
What is the correct system design in the case where some services are too expensive to recreate the environment for? One idea that comes to mind is to replace the actual service in the workflow with a "shell"/"dummy" task that calls the actual service which is backed by a pool of workers elsewhere. However, I'm not sure if such gymnastics are a clean system design.

Edit: The idea of a "warm pool of workers" is very general and well-known, so I find it counter-intuitive that it's just not there for Kubernetes pods. For instance, there could be a lightweight client that listens for requests or messages and calls a user-defined callback. Is there some fundamental reason/limitation why this is not done?

I would be very thankful for insights regarding the above highlighted considerations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design decision - correct approach for costly pod starts #11719

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Design decision - correct approach for costly pod starts #11719

RonaldGalea Aug 30, 2023

Pod per task

Pod worker pool

My particular use case

Replies: 0 comments

RonaldGalea
Aug 30, 2023