[Workflow] Make workflow engine configurable (and other improvements) #7090

cgillum · 2023-10-24T03:20:08Z

Description

Until now, there was no way to configure the workflow engine. Some values, such as maximum concurrency thresholds, had been hardcoded. With this PR, however, it's now possible to configure the workflow engine. In particular, this PR enables configuring the maximum workflow and activity execution concurrency (defaults to 100, as it did previously).

This PR also contains other misc. fixes and code refactoring. Specifically:

Reorganized some of the workflow and activity configuration to make it more cleanly separated from workflow engine configuration.
Added some additional debug logging based on issues I had while debugging my changes.
Updated workflow test code to output debug logs (previously was only outputting info logs)

I've added comments to the PR directly to help explain some of the specific changes.

Issue reference

#7089

Checklist

Code compiles correctly
Created/updated tests
Unit tests passing
End-to-end tests passing
Extended the documentation / Created issue in the https://github.com/dapr/docs/ repo: Workflow configuration docs#3844
~~Specification has been updated / Created issue in the https://github.com/dapr/docs/ repo: dapr/docs#_[issue number]~~
~~Provided sample for the feature / Created issue in the https://github.com/dapr/docs/ repo: dapr/docs#_[issue number]~~

Signed-off-by: Chris Gillum <cgillum@microsoft.com>

cgillum

Adding explanatory comments

cgillum · 2023-10-24T03:21:09Z

pkg/runtime/wfengine/activity.go

@@ -130,8 +138,6 @@ func (a *activityActor) InvokeReminder(ctx context.Context, actorID string, remi
 		}
 	}

-	// TODO: Purge actor state based on some data retention policy


Removed this TODO since this work was already done in v1.11.

cgillum · 2023-10-24T03:22:49Z

pkg/runtime/wfengine/backend.go

-	ScheduleWorkflow(ctx context.Context, wi *backend.OrchestrationWorkItem) error
-	ScheduleActivity(ctx context.Context, wi *backend.ActivityWorkItem) error
+// actorsBackendConfig is the configuration for the workflow engine's actors backend
+type actorsBackendConfig struct {


This struct and the functions that follow were moved from wfengine.go to here since they are specific to the workflow backend. No changes were made to these.

cgillum · 2023-10-24T03:30:59Z

pkg/runtime/wfengine/backend.go

-// workflowScheduler is an interface for pushing work items into the backend
-type workflowScheduler interface {
-	ScheduleWorkflow(ctx context.Context, wi *backend.OrchestrationWorkItem) error
-	ScheduleActivity(ctx context.Context, wi *backend.ActivityWorkItem) error


I replaced the workflowScheduler interface with two separate func definitions: workflowScheduler and activityScheduler, defined in workflow.go and activity.go respectively. This change made it easier to refactor some of the configuration. No changes were made to the required method signatures.

cgillum · 2023-10-24T03:32:32Z

pkg/runtime/wfengine/backend.go

@@ -210,12 +257,18 @@ func (*actorBackend) DeleteTaskHub(context.Context) error {

 // GetActivityWorkItem implements backend.Backend
 func (be *actorBackend) GetActivityWorkItem(ctx context.Context) (*backend.ActivityWorkItem, error) {
-	// Wait for the workflow actor to signal us with some work to do
+	// Wait for the activity actor to signal us with some work to do
+	wfLogger.Debug("Actor backend is waiting for an activity actor to schedule an invocation.")


These debug logs help make it easier to understand what's going on if a workflow or activity doesn't run when we expect it to (a problem that we've often seen but lacked clarity on what's actually stuck).

cgillum · 2023-10-24T03:34:08Z

pkg/runtime/wfengine/wfengine.go

+		wfe.backend,
+		wfe.executor,
+		wfBackendLogger,
+		backend.WithMaxParallelism(wfe.spec.MaxConcurrentWorkflows))


This is where we actually reference the new max concurrency configuration settings. Previously these values were hardcoded to 100 (see the deleted lines above here).

Signed-off-by: Chris Gillum <cgillum@microsoft.com>

yaron2 · 2023-10-24T04:00:07Z

Adding configuration options are useful only if developers know when they should tweak them. For workflow concurrency, what is the guidelines for developers? When should they tweak the number below and above 100, and what are the tradeoffs of tweaking that number?

codecov · 2023-10-24T04:08:12Z

Codecov Report

Attention: 29 lines in your changes are missing coverage. Please review.

Comparison is base (a86f9d6) 64.91% compared to head (f05ffc7) 64.87%.

Files	Patch %	Lines
pkg/runtime/wfengine/backend.go	78.43%	11 Missing ⚠️
pkg/config/configuration.go	33.33%	9 Missing and 1 partial ⚠️
pkg/runtime/wfengine/activity.go	41.66%	6 Missing and 1 partial ⚠️
pkg/runtime/runtime.go	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7090      +/-   ##
==========================================
- Coverage   64.91%   64.87%   -0.04%     
==========================================
  Files         221      221              
  Lines       21004    21056      +52     
==========================================
+ Hits        13634    13661      +27     
- Misses       6213     6239      +26     
+ Partials     1157     1156       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pkg/config/configuration.go

pkg/runtime/wfengine/activity.go

pkg/runtime/wfengine/backend.go

cgillum · 2023-10-24T05:35:36Z

Adding configuration options are useful only if developers know when they should tweak them. For workflow concurrency, what is the guidelines for developers? When should they tweak the number below and above 100, and what are the tradeoffs of tweaking that number?

@yaron2 Does the linked docs PR help answer this question? I can add more guidance if necessary but may need direction in terms of where the best place to document such guidance is.

yaron2 · 2023-10-24T05:44:31Z

Adding configuration options are useful only if developers know when they should tweak them. For workflow concurrency, what is the guidelines for developers? When should they tweak the number below and above 100, and what are the tradeoffs of tweaking that number?

@yaron2 Does the linked docs PR help answer this question? I can add more guidance if necessary but may need direction in terms of where the best place to document such guidance is.

Additional context needs to be added, and I'll make comments on the docs PR based on the discussion here.

What can a user expect when they schedule a workflow or activity that passes the threshold? Is it rejected or queued until a workflow/activity complete?

cgillum · 2023-10-24T16:48:55Z

What can a user expect when they schedule a workflow or activity that passes the threshold? Is it rejected or queued until a workflow/activity complete?

A request that passes the threshold will be queued until a workflow/activity completes.

Note that this is primarily intended to be used as a safety mechanism to prevent runaway execution by workflows. For example, if a workflow were to schedule 1,000 concurrent CPU-intensive activities, this feature can be used to help ensure that we don't try to execute all of that work at the same time on any given machine.

cgillum · 2023-10-24T16:54:41Z

When should they tweak the number below and above 100

Below 100 when the developer/operator needs to run workflows in resource-limited environments, or if it is known in advance that activity/workflow execution can be very CPU/memory intensive.

Above 100 when that value is too restrictive given the amount of compute resources, limiting their overall workflow throughput.

and what are the tradeoffs of tweaking that number?

Increasing the value allows for greater concurrency (throughput) at the cost of more resource consumption. Lowering the value leads to more conservative resource consumption at the cost of reduced throughput.

Note that we can debate whether 100 is the right number. I'm beginning to think setting the default higher (perhaps 1,000) might be reasonable. However, as of 1.12, the limit is hardcoded to 100.

Signed-off-by: Chris Gillum <cgillum@microsoft.com>

pkg/config/configuration.go

Signed-off-by: Chris Gillum <cgillum@microsoft.com>

yaron2 · 2023-10-31T18:35:03Z

Below 100 when the developer/operator needs to run workflows in resource-limited environments, or if it is known in advance that activity/workflow execution can be very CPU/memory intensive.

But what is resource-limited? It's very subjective. I'm extremely worried about using magic numbers with no proper guidance. The fact you wrote this could be made 1000 instead of 100 strengthens this.

Increasing the value allows for greater concurrency (throughput) at the cost of more resource consumption. Lowering the value leads to more conservative resource consumption at the cost of reduced throughput.

This might be a proper way to describe this to users, so they know to observe and tweak if needed. Please carry out this language to the associated docs issue.

yaron2 · 2023-10-31T18:36:29Z

A request that passes the threshold will be queued until a workflow/activity completes

When queued, will users get an ACK from the Dapr runtime immediately after queuing or will the client hang until its picked up?

cgillum · 2023-10-31T20:16:45Z

But what is resource-limited? It's very subjective. I'm extremely worried about using magic numbers with no proper guidance.

Can you clarify what specifically you're extremely worried about? People configuring it wrong in a particular way?

Backing up to talk a little bit more about the motivation for this work: In my experience, the most common resource-related issue that users run into is running out of memory. It's really easy to run out of memory because the workflow programming model makes it really easy to schedule tons of work to run in parallel, which is obviously a double-edged sword. If those activities need to allocate a non-trivial amount of memory, then apps may start getting ugly OOM failures that create massive app stability issues. Next up would be overwhelming some downstream dependency, like a database, with too many connections because you have too many activities running concurrently. High CPU is another one, which should be self-explanatory. These are all app-specific issues which will apply to some workloads but not to others. The fastest and most effective way to mitigate most of these problems when they impact production is to make a config change to reduce concurrency. Dapr Workflow users today have no such capability since we currently hardcode these throttles - hence this PR.

This might be a proper way to describe this to users, so they know to observe and tweak if needed. Please carry out this language to the associated docs issue.

Yes, this is absolutely an "observe and tweak if needed" configuration knob. I can make this clearer in the documentation.

When queued, will users get an ACK from the Dapr runtime immediately after queuing or will the client hang until its picked up?

Regardless of whether you're above or below the threshold, the client always gets an immediate ACK. Clients are never blocked on workflow operations. Similarly, workflows are never blocked on scheduling activity calls. These existing throttles are expected only to add latency between the time you schedule some work and the time when that work starts executing.

pkg/config/configuration.go

pkg/apis/configuration/v1alpha1/types.go

pkg/config/configuration.go

Signed-off-by: Chris Gillum <cgillum@microsoft.com>

mukundansundar · 2023-11-15T09:31:08Z

pkg/runtime/wfengine/workflow.go

+	if !wf.cachingDisabled {
+		// update cached state
+		wf.states.Store(actorID, state)
+	}


do we need the same check at the start of the function, at line 514?

Also in saveInternalState the cache is being updated even if the TransactionStateOperation fails lines 539 and 549?

Not strictly necessary for 514 since the cache should be empty.

For 539, 549, yes this needs to be fixed. That issue is why some Cosmos DB compatibility problems weren’t more quickly detected. Could do it now or in another PR.

mukundansundar · 2023-11-15T09:40:31Z

/ok-to-test

mukundansundar · 2023-11-15T09:40:54Z

/test-version-skew

dapr-bot · 2023-11-15T09:40:58Z

Dapr E2E test

🔗 Link to Action run

Commit ref: e32647e

✅ Build succeeded for linux/amd64

Image tag: dapre2e5f2fa5fcd1l
Test image tag: dapre2e5f2fa5fcd1l

✅ Infrastructure deployed

Cluster	Resource group name	Azure region
Linux	`Dapr-E2E-dapre2e5f2fa5fcd1l`	westus3
Windows	`Dapr-E2E-dapre2e5f2fa5fcd1w`	westus3
Linux/arm64	`Dapr-E2E-dapre2e5f2fa5fcd1la`	eastus

✅ Build succeeded for windows/amd64

Image tag: dapre2e5f2fa5fcd1w
Test image tag: dapre2e5f2fa5fcd1w

❌ Tests failed on windows/amd64

Please check the logs for details on the error.

❌ Tests failed on linux/amd64

Please check the logs for details on the error.

dapr-bot · 2023-11-15T09:41:23Z

Dapr Version Skew test (dapr-sidecar-master - 1.12.0)

🔗 Link to Action run

Commit ref: e32647e

❌ Version Skew tests failed

Please check the logs for details on the error.

dapr-bot · 2023-11-15T09:41:24Z

Dapr Version Skew test (control-plane-master - 1.12.0)

🔗 Link to Action run

Commit ref: e32647e

✅ Version Skew tests passed

Signed-off-by: Chris Gillum <cgillum@microsoft.com>

dapr-bot · 2023-11-21T04:33:08Z

Dapr Version Skew test (dapr-sidecar-master - 1.12.2)

🔗 Link to Action run

Commit ref: e32647e

❌ Version Skew tests failed

Please check the logs for details on the error.

mukundansundar · 2023-11-21T12:57:44Z

/ok-to-test

dapr-bot · 2023-11-21T12:58:12Z

Dapr E2E test

🔗 Link to Action run

Commit ref: bf86875

✅ Build succeeded for linux/amd64

Image tag: dapre2e26b9f10ea9l
Test image tag: dapre2e26b9f10ea9l

✅ Infrastructure deployed

Cluster	Resource group name	Azure region
Linux	`Dapr-E2E-dapre2e26b9f10ea9l`	westus3
Windows	`Dapr-E2E-dapre2e26b9f10ea9w`	westus3
Linux/arm64	`Dapr-E2E-dapre2e26b9f10ea9la`	eastus

✅ Build succeeded for windows/amd64

Image tag: dapre2e26b9f10ea9w
Test image tag: dapre2e26b9f10ea9w

✅ Tests succeeded on windows/amd64

Image tag: dapre2e26b9f10ea9w
Test image tag: dapre2e26b9f10ea9w

✅ Tests succeeded on linux/amd64

Image tag: dapre2e26b9f10ea9l
Test image tag: dapre2e26b9f10ea9l

cgillum · 2023-11-21T21:56:08Z

Tests are looking good. Are we good to merge? @mukundansundar @artursouza @yaron2

I see that there's an automerge label added, but I'm not sure what the criteria for that is.

cgillum added 2 commits October 24, 2023 00:52

Workflow configuration support + code cleanup

708af2c

Signed-off-by: Chris Gillum <cgillum@microsoft.com>

Testing and logging improvements + fixes

e29ec26

Signed-off-by: Chris Gillum <cgillum@microsoft.com>

cgillum commented Oct 24, 2023

View reviewed changes

Adding configuration tests

60401b1

Signed-off-by: Chris Gillum <cgillum@microsoft.com>

cgillum marked this pull request as ready for review October 24, 2023 03:50

cgillum requested review from a team as code owners October 24, 2023 03:50

ItalyPaleAle requested changes Oct 24, 2023

View reviewed changes

pkg/config/configuration.go Show resolved Hide resolved

pkg/config/configuration.go Outdated Show resolved Hide resolved

pkg/runtime/wfengine/activity.go Outdated Show resolved Hide resolved

pkg/runtime/wfengine/backend.go Show resolved Hide resolved

cgillum added 4 commits October 26, 2023 19:30

PR feedback and renamed properties to be a bit more specific

8edf866

Signed-off-by: Chris Gillum <cgillum@microsoft.com>

Merge branch 'master' into workflow-config

abaaf67

Signed-off-by: Chris Gillum <cgillum@microsoft.com>

PR feedback: if/else --> switch

3583dfe

Signed-off-by: Chris Gillum <cgillum@microsoft.com>

Merge branch 'master' into workflow-config

d773e3b

Signed-off-by: Chris Gillum <cgillum@microsoft.com>

cgillum requested a review from ItalyPaleAle October 30, 2023 23:09

ItalyPaleAle requested changes Oct 31, 2023

View reviewed changes

pkg/config/configuration.go Show resolved Hide resolved

cgillum added 2 commits October 31, 2023 01:34

More PR feedback + increase default values

77f5a19

Signed-off-by: Chris Gillum <cgillum@microsoft.com>

Merge branch 'master' into workflow-config

cf7d32d

Signed-off-by: Chris Gillum <cgillum@microsoft.com>

ItalyPaleAle previously approved these changes Oct 31, 2023

View reviewed changes

Use new getters

5e54218

Signed-off-by: Chris Gillum <cgillum@microsoft.com>

cgillum dismissed ItalyPaleAle’s stale review via 5e54218 October 31, 2023 01:36

ItalyPaleAle previously approved these changes Oct 31, 2023

View reviewed changes

mukundansundar reviewed Nov 2, 2023

View reviewed changes

pkg/config/configuration.go Outdated Show resolved Hide resolved

artursouza previously approved these changes Nov 14, 2023

View reviewed changes

Merge branch 'master' into workflow-config

050d036

DeepanshuA reviewed Nov 14, 2023

View reviewed changes

pkg/apis/configuration/v1alpha1/types.go Outdated Show resolved Hide resolved

pkg/apis/configuration/v1alpha1/types.go Outdated Show resolved Hide resolved

pkg/config/configuration.go Outdated Show resolved Hide resolved

pkg/config/configuration.go Outdated Show resolved Hide resolved

Fixing code comments

6a8913c

Signed-off-by: Chris Gillum <cgillum@microsoft.com>

cgillum dismissed artursouza’s stale review via 6a8913c November 14, 2023 21:05

Merge branch 'master' into workflow-config

6c797b5

mukundansundar reviewed Nov 15, 2023

View reviewed changes

Merge branch 'master' into workflow-config

e32647e

cgillum added 2 commits November 15, 2023 21:23

PR feedback and fix config unit test

9c01990

Signed-off-by: Chris Gillum <cgillum@microsoft.com>

Merge branch 'master' into workflow-config

6b77b7f

ItalyPaleAle approved these changes Nov 20, 2023

View reviewed changes

Merge branch 'master' into workflow-config

9ed79ee

Merge branch 'master' into workflow-config

bf86875

Merge branch 'master' into workflow-config

dfeca70

mukundansundar approved these changes Nov 21, 2023

View reviewed changes

mukundansundar added automerge Allows DaprBot to automerge and update PR if all approvals are in place autoupdate DaprBot will keep the Pull Request up to date with master branch labels Nov 21, 2023

Merge branch 'master' into workflow-config

f05ffc7

artursouza merged commit bef0ca2 into dapr:master Nov 21, 2023
31 of 32 checks passed

cgillum mentioned this pull request Dec 23, 2023

[Workflow] Make the Dapr workflow engine configurable #7089

Closed

JoshVanL added this to the v1.13 milestone Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Workflow] Make workflow engine configurable (and other improvements) #7090

[Workflow] Make workflow engine configurable (and other improvements) #7090

cgillum commented Oct 24, 2023 •

edited

cgillum left a comment

cgillum Oct 24, 2023

cgillum Oct 24, 2023

cgillum Oct 24, 2023

cgillum Oct 24, 2023

cgillum Oct 24, 2023

yaron2 commented Oct 24, 2023

codecov bot commented Oct 24, 2023 •

edited

cgillum commented Oct 24, 2023

yaron2 commented Oct 24, 2023

cgillum commented Oct 24, 2023

cgillum commented Oct 24, 2023

yaron2 commented Oct 31, 2023

yaron2 commented Oct 31, 2023

cgillum commented Oct 31, 2023

mukundansundar Nov 15, 2023

mukundansundar Nov 15, 2023

cgillum Nov 15, 2023

mukundansundar commented Nov 15, 2023

mukundansundar commented Nov 15, 2023

dapr-bot commented Nov 15, 2023 •

edited

dapr-bot commented Nov 15, 2023 •

edited

dapr-bot commented Nov 15, 2023 •

edited

dapr-bot commented Nov 21, 2023 •

edited

mukundansundar commented Nov 21, 2023

dapr-bot commented Nov 21, 2023 •

edited

cgillum commented Nov 21, 2023

[Workflow] Make workflow engine configurable (and other improvements) #7090

[Workflow] Make workflow engine configurable (and other improvements) #7090

Conversation

cgillum commented Oct 24, 2023 • edited

Description

Issue reference

Checklist

cgillum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaron2 commented Oct 24, 2023

codecov bot commented Oct 24, 2023 • edited

Codecov Report

cgillum commented Oct 24, 2023

yaron2 commented Oct 24, 2023

cgillum commented Oct 24, 2023

cgillum commented Oct 24, 2023

yaron2 commented Oct 31, 2023

yaron2 commented Oct 31, 2023

cgillum commented Oct 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mukundansundar commented Nov 15, 2023

mukundansundar commented Nov 15, 2023

dapr-bot commented Nov 15, 2023 • edited

Dapr E2E test

✅ Build succeeded for linux/amd64

✅ Infrastructure deployed

✅ Build succeeded for windows/amd64

❌ Tests failed on windows/amd64

❌ Tests failed on linux/amd64

dapr-bot commented Nov 15, 2023 • edited

Dapr Version Skew test (dapr-sidecar-master - 1.12.0)

❌ Version Skew tests failed

dapr-bot commented Nov 15, 2023 • edited

Dapr Version Skew test (control-plane-master - 1.12.0)

✅ Version Skew tests passed

dapr-bot commented Nov 21, 2023 • edited

Dapr Version Skew test (dapr-sidecar-master - 1.12.2)

❌ Version Skew tests failed

mukundansundar commented Nov 21, 2023

dapr-bot commented Nov 21, 2023 • edited

Dapr E2E test

✅ Build succeeded for linux/amd64

✅ Infrastructure deployed

✅ Build succeeded for windows/amd64

✅ Tests succeeded on windows/amd64

✅ Tests succeeded on linux/amd64

cgillum commented Nov 21, 2023

cgillum commented Oct 24, 2023 •

edited

codecov bot commented Oct 24, 2023 •

edited

dapr-bot commented Nov 15, 2023 •

edited

dapr-bot commented Nov 15, 2023 •

edited

dapr-bot commented Nov 15, 2023 •

edited

dapr-bot commented Nov 21, 2023 •

edited

dapr-bot commented Nov 21, 2023 •

edited