New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Workflow] Make workflow engine configurable (and other improvements) #7090
Conversation
Signed-off-by: Chris Gillum <cgillum@microsoft.com>
Signed-off-by: Chris Gillum <cgillum@microsoft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding explanatory comments
@@ -130,8 +138,6 @@ func (a *activityActor) InvokeReminder(ctx context.Context, actorID string, remi | |||
} | |||
} | |||
|
|||
// TODO: Purge actor state based on some data retention policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this TODO since this work was already done in v1.11.
ScheduleWorkflow(ctx context.Context, wi *backend.OrchestrationWorkItem) error | ||
ScheduleActivity(ctx context.Context, wi *backend.ActivityWorkItem) error | ||
// actorsBackendConfig is the configuration for the workflow engine's actors backend | ||
type actorsBackendConfig struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This struct and the functions that follow were moved from wfengine.go to here since they are specific to the workflow backend. No changes were made to these.
// workflowScheduler is an interface for pushing work items into the backend | ||
type workflowScheduler interface { | ||
ScheduleWorkflow(ctx context.Context, wi *backend.OrchestrationWorkItem) error | ||
ScheduleActivity(ctx context.Context, wi *backend.ActivityWorkItem) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced the workflowScheduler
interface with two separate func definitions: workflowScheduler
and activityScheduler
, defined in workflow.go and activity.go respectively. This change made it easier to refactor some of the configuration. No changes were made to the required method signatures.
@@ -210,12 +257,18 @@ func (*actorBackend) DeleteTaskHub(context.Context) error { | |||
|
|||
// GetActivityWorkItem implements backend.Backend | |||
func (be *actorBackend) GetActivityWorkItem(ctx context.Context) (*backend.ActivityWorkItem, error) { | |||
// Wait for the workflow actor to signal us with some work to do | |||
// Wait for the activity actor to signal us with some work to do | |||
wfLogger.Debug("Actor backend is waiting for an activity actor to schedule an invocation.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These debug logs help make it easier to understand what's going on if a workflow or activity doesn't run when we expect it to (a problem that we've often seen but lacked clarity on what's actually stuck).
pkg/runtime/wfengine/wfengine.go
Outdated
wfe.backend, | ||
wfe.executor, | ||
wfBackendLogger, | ||
backend.WithMaxParallelism(wfe.spec.MaxConcurrentWorkflows)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where we actually reference the new max concurrency configuration settings. Previously these values were hardcoded to 100 (see the deleted lines above here).
Signed-off-by: Chris Gillum <cgillum@microsoft.com>
Adding configuration options are useful only if developers know when they should tweak them. For workflow concurrency, what is the guidelines for developers? When should they tweak the number below and above 100, and what are the tradeoffs of tweaking that number? |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #7090 +/- ##
==========================================
- Coverage 64.91% 64.87% -0.04%
==========================================
Files 221 221
Lines 21004 21056 +52
==========================================
+ Hits 13634 13661 +27
- Misses 6213 6239 +26
+ Partials 1157 1156 -1 ☔ View full report in Codecov by Sentry. |
@yaron2 Does the linked docs PR help answer this question? I can add more guidance if necessary but may need direction in terms of where the best place to document such guidance is. |
Additional context needs to be added, and I'll make comments on the docs PR based on the discussion here. What can a user expect when they schedule a workflow or activity that passes the threshold? Is it rejected or queued until a workflow/activity complete? |
A request that passes the threshold will be queued until a workflow/activity completes. Note that this is primarily intended to be used as a safety mechanism to prevent runaway execution by workflows. For example, if a workflow were to schedule 1,000 concurrent CPU-intensive activities, this feature can be used to help ensure that we don't try to execute all of that work at the same time on any given machine. |
Below 100 when the developer/operator needs to run workflows in resource-limited environments, or if it is known in advance that activity/workflow execution can be very CPU/memory intensive. Above 100 when that value is too restrictive given the amount of compute resources, limiting their overall workflow throughput.
Increasing the value allows for greater concurrency (throughput) at the cost of more resource consumption. Lowering the value leads to more conservative resource consumption at the cost of reduced throughput. Note that we can debate whether 100 is the right number. I'm beginning to think setting the default higher (perhaps 1,000) might be reasonable. However, as of 1.12, the limit is hardcoded to 100. |
Signed-off-by: Chris Gillum <cgillum@microsoft.com>
Signed-off-by: Chris Gillum <cgillum@microsoft.com>
Signed-off-by: Chris Gillum <cgillum@microsoft.com>
Signed-off-by: Chris Gillum <cgillum@microsoft.com>
Signed-off-by: Chris Gillum <cgillum@microsoft.com>
Signed-off-by: Chris Gillum <cgillum@microsoft.com>
Signed-off-by: Chris Gillum <cgillum@microsoft.com>
But what is resource-limited? It's very subjective. I'm extremely worried about using magic numbers with no proper guidance. The fact you wrote this could be made 1000 instead of 100 strengthens this.
This might be a proper way to describe this to users, so they know to observe and tweak if needed. Please carry out this language to the associated docs issue. |
When queued, will users get an ACK from the Dapr runtime immediately after queuing or will the client hang until its picked up? |
Can you clarify what specifically you're extremely worried about? People configuring it wrong in a particular way? Backing up to talk a little bit more about the motivation for this work: In my experience, the most common resource-related issue that users run into is running out of memory. It's really easy to run out of memory because the workflow programming model makes it really easy to schedule tons of work to run in parallel, which is obviously a double-edged sword. If those activities need to allocate a non-trivial amount of memory, then apps may start getting ugly OOM failures that create massive app stability issues. Next up would be overwhelming some downstream dependency, like a database, with too many connections because you have too many activities running concurrently. High CPU is another one, which should be self-explanatory. These are all app-specific issues which will apply to some workloads but not to others. The fastest and most effective way to mitigate most of these problems when they impact production is to make a config change to reduce concurrency. Dapr Workflow users today have no such capability since we currently hardcode these throttles - hence this PR.
Yes, this is absolutely an "observe and tweak if needed" configuration knob. I can make this clearer in the documentation.
Regardless of whether you're above or below the threshold, the client always gets an immediate ACK. Clients are never blocked on workflow operations. Similarly, workflows are never blocked on scheduling activity calls. These existing throttles are expected only to add latency between the time you schedule some work and the time when that work starts executing. |
Signed-off-by: Chris Gillum <cgillum@microsoft.com>
pkg/runtime/wfengine/workflow.go
Outdated
if !wf.cachingDisabled { | ||
// update cached state | ||
wf.states.Store(actorID, state) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the same check at the start of the function, at line 514?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also in saveInternalState
the cache is being updated even if the TransactionStateOperation fails lines 539 and 549?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly necessary for 514 since the cache should be empty.
For 539, 549, yes this needs to be fixed. That issue is why some Cosmos DB compatibility problems weren’t more quickly detected. Could do it now or in another PR.
/ok-to-test |
/test-version-skew |
Dapr E2E testCommit ref: e32647e ✅ Build succeeded for linux/amd64
✅ Infrastructure deployed
✅ Build succeeded for windows/amd64
❌ Tests failed on windows/amd64Please check the logs for details on the error. ❌ Tests failed on linux/amd64Please check the logs for details on the error. |
Dapr Version Skew test (dapr-sidecar-master - 1.12.0)Commit ref: e32647e ❌ Version Skew tests failedPlease check the logs for details on the error. |
Dapr Version Skew test (control-plane-master - 1.12.0)Commit ref: e32647e ✅ Version Skew tests passed |
Signed-off-by: Chris Gillum <cgillum@microsoft.com>
Dapr Version Skew test (dapr-sidecar-master - 1.12.2)Commit ref: e32647e ❌ Version Skew tests failedPlease check the logs for details on the error. |
/ok-to-test |
Dapr E2E testCommit ref: bf86875 ✅ Build succeeded for linux/amd64
✅ Infrastructure deployed
✅ Build succeeded for windows/amd64
✅ Tests succeeded on windows/amd64
✅ Tests succeeded on linux/amd64
|
Tests are looking good. Are we good to merge? @mukundansundar @artursouza @yaron2 I see that there's an |
Description
Until now, there was no way to configure the workflow engine. Some values, such as maximum concurrency thresholds, had been hardcoded. With this PR, however, it's now possible to configure the workflow engine. In particular, this PR enables configuring the maximum workflow and activity execution concurrency (defaults to 100, as it did previously).
This PR also contains other misc. fixes and code refactoring. Specifically:
I've added comments to the PR directly to help explain some of the specific changes.
Issue reference
#7089
Checklist
Specification has been updated / Created issue in the https://github.com/dapr/docs/ repo: dapr/docs#_[issue number]Provided sample for the feature / Created issue in the https://github.com/dapr/docs/ repo: dapr/docs#_[issue number]