Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Workflow] Make workflow engine configurable (and other improvements) #7090

Merged
merged 22 commits into from Nov 21, 2023

Conversation

cgillum
Copy link
Contributor

@cgillum cgillum commented Oct 24, 2023

Description

Until now, there was no way to configure the workflow engine. Some values, such as maximum concurrency thresholds, had been hardcoded. With this PR, however, it's now possible to configure the workflow engine. In particular, this PR enables configuring the maximum workflow and activity execution concurrency (defaults to 100, as it did previously).

This PR also contains other misc. fixes and code refactoring. Specifically:

  • Reorganized some of the workflow and activity configuration to make it more cleanly separated from workflow engine configuration.
  • Added some additional debug logging based on issues I had while debugging my changes.
  • Updated workflow test code to output debug logs (previously was only outputting info logs)

I've added comments to the PR directly to help explain some of the specific changes.

Issue reference

#7089

Checklist

Signed-off-by: Chris Gillum <cgillum@microsoft.com>
Signed-off-by: Chris Gillum <cgillum@microsoft.com>
Copy link
Contributor Author

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding explanatory comments

@@ -130,8 +138,6 @@ func (a *activityActor) InvokeReminder(ctx context.Context, actorID string, remi
}
}

// TODO: Purge actor state based on some data retention policy
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this TODO since this work was already done in v1.11.

ScheduleWorkflow(ctx context.Context, wi *backend.OrchestrationWorkItem) error
ScheduleActivity(ctx context.Context, wi *backend.ActivityWorkItem) error
// actorsBackendConfig is the configuration for the workflow engine's actors backend
type actorsBackendConfig struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct and the functions that follow were moved from wfengine.go to here since they are specific to the workflow backend. No changes were made to these.

// workflowScheduler is an interface for pushing work items into the backend
type workflowScheduler interface {
ScheduleWorkflow(ctx context.Context, wi *backend.OrchestrationWorkItem) error
ScheduleActivity(ctx context.Context, wi *backend.ActivityWorkItem) error
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced the workflowScheduler interface with two separate func definitions: workflowScheduler and activityScheduler, defined in workflow.go and activity.go respectively. This change made it easier to refactor some of the configuration. No changes were made to the required method signatures.

@@ -210,12 +257,18 @@ func (*actorBackend) DeleteTaskHub(context.Context) error {

// GetActivityWorkItem implements backend.Backend
func (be *actorBackend) GetActivityWorkItem(ctx context.Context) (*backend.ActivityWorkItem, error) {
// Wait for the workflow actor to signal us with some work to do
// Wait for the activity actor to signal us with some work to do
wfLogger.Debug("Actor backend is waiting for an activity actor to schedule an invocation.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These debug logs help make it easier to understand what's going on if a workflow or activity doesn't run when we expect it to (a problem that we've often seen but lacked clarity on what's actually stuck).

wfe.backend,
wfe.executor,
wfBackendLogger,
backend.WithMaxParallelism(wfe.spec.MaxConcurrentWorkflows))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we actually reference the new max concurrency configuration settings. Previously these values were hardcoded to 100 (see the deleted lines above here).

Signed-off-by: Chris Gillum <cgillum@microsoft.com>
@cgillum cgillum marked this pull request as ready for review October 24, 2023 03:50
@cgillum cgillum requested review from a team as code owners October 24, 2023 03:50
@yaron2
Copy link
Member

yaron2 commented Oct 24, 2023

Adding configuration options are useful only if developers know when they should tweak them. For workflow concurrency, what is the guidelines for developers? When should they tweak the number below and above 100, and what are the tradeoffs of tweaking that number?

@codecov
Copy link

codecov bot commented Oct 24, 2023

Codecov Report

Attention: 29 lines in your changes are missing coverage. Please review.

Comparison is base (a86f9d6) 64.91% compared to head (f05ffc7) 64.87%.

Files Patch % Lines
pkg/runtime/wfengine/backend.go 78.43% 11 Missing ⚠️
pkg/config/configuration.go 33.33% 9 Missing and 1 partial ⚠️
pkg/runtime/wfengine/activity.go 41.66% 6 Missing and 1 partial ⚠️
pkg/runtime/runtime.go 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7090      +/-   ##
==========================================
- Coverage   64.91%   64.87%   -0.04%     
==========================================
  Files         221      221              
  Lines       21004    21056      +52     
==========================================
+ Hits        13634    13661      +27     
- Misses       6213     6239      +26     
+ Partials     1157     1156       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pkg/config/configuration.go Show resolved Hide resolved
pkg/config/configuration.go Outdated Show resolved Hide resolved
pkg/runtime/wfengine/activity.go Outdated Show resolved Hide resolved
pkg/runtime/wfengine/backend.go Show resolved Hide resolved
@cgillum
Copy link
Contributor Author

cgillum commented Oct 24, 2023

Adding configuration options are useful only if developers know when they should tweak them. For workflow concurrency, what is the guidelines for developers? When should they tweak the number below and above 100, and what are the tradeoffs of tweaking that number?

@yaron2 Does the linked docs PR help answer this question? I can add more guidance if necessary but may need direction in terms of where the best place to document such guidance is.

@yaron2
Copy link
Member

yaron2 commented Oct 24, 2023

Adding configuration options are useful only if developers know when they should tweak them. For workflow concurrency, what is the guidelines for developers? When should they tweak the number below and above 100, and what are the tradeoffs of tweaking that number?

@yaron2 Does the linked docs PR help answer this question? I can add more guidance if necessary but may need direction in terms of where the best place to document such guidance is.

Additional context needs to be added, and I'll make comments on the docs PR based on the discussion here.

What can a user expect when they schedule a workflow or activity that passes the threshold? Is it rejected or queued until a workflow/activity complete?

@cgillum
Copy link
Contributor Author

cgillum commented Oct 24, 2023

What can a user expect when they schedule a workflow or activity that passes the threshold? Is it rejected or queued until a workflow/activity complete?

A request that passes the threshold will be queued until a workflow/activity completes.

Note that this is primarily intended to be used as a safety mechanism to prevent runaway execution by workflows. For example, if a workflow were to schedule 1,000 concurrent CPU-intensive activities, this feature can be used to help ensure that we don't try to execute all of that work at the same time on any given machine.

@cgillum
Copy link
Contributor Author

cgillum commented Oct 24, 2023

When should they tweak the number below and above 100

Below 100 when the developer/operator needs to run workflows in resource-limited environments, or if it is known in advance that activity/workflow execution can be very CPU/memory intensive.

Above 100 when that value is too restrictive given the amount of compute resources, limiting their overall workflow throughput.

and what are the tradeoffs of tweaking that number?

Increasing the value allows for greater concurrency (throughput) at the cost of more resource consumption. Lowering the value leads to more conservative resource consumption at the cost of reduced throughput.

Note that we can debate whether 100 is the right number. I'm beginning to think setting the default higher (perhaps 1,000) might be reasonable. However, as of 1.12, the limit is hardcoded to 100.

Signed-off-by: Chris Gillum <cgillum@microsoft.com>
Signed-off-by: Chris Gillum <cgillum@microsoft.com>
Signed-off-by: Chris Gillum <cgillum@microsoft.com>
Signed-off-by: Chris Gillum <cgillum@microsoft.com>
Signed-off-by: Chris Gillum <cgillum@microsoft.com>
Signed-off-by: Chris Gillum <cgillum@microsoft.com>
ItalyPaleAle
ItalyPaleAle previously approved these changes Oct 31, 2023
Signed-off-by: Chris Gillum <cgillum@microsoft.com>
ItalyPaleAle
ItalyPaleAle previously approved these changes Oct 31, 2023
@yaron2
Copy link
Member

yaron2 commented Oct 31, 2023

Below 100 when the developer/operator needs to run workflows in resource-limited environments, or if it is known in advance that activity/workflow execution can be very CPU/memory intensive.

But what is resource-limited? It's very subjective. I'm extremely worried about using magic numbers with no proper guidance. The fact you wrote this could be made 1000 instead of 100 strengthens this.

Increasing the value allows for greater concurrency (throughput) at the cost of more resource consumption. Lowering the value leads to more conservative resource consumption at the cost of reduced throughput.

This might be a proper way to describe this to users, so they know to observe and tweak if needed. Please carry out this language to the associated docs issue.

@yaron2
Copy link
Member

yaron2 commented Oct 31, 2023

A request that passes the threshold will be queued until a workflow/activity completes

When queued, will users get an ACK from the Dapr runtime immediately after queuing or will the client hang until its picked up?

@cgillum
Copy link
Contributor Author

cgillum commented Oct 31, 2023

But what is resource-limited? It's very subjective. I'm extremely worried about using magic numbers with no proper guidance.

Can you clarify what specifically you're extremely worried about? People configuring it wrong in a particular way?

Backing up to talk a little bit more about the motivation for this work: In my experience, the most common resource-related issue that users run into is running out of memory. It's really easy to run out of memory because the workflow programming model makes it really easy to schedule tons of work to run in parallel, which is obviously a double-edged sword. If those activities need to allocate a non-trivial amount of memory, then apps may start getting ugly OOM failures that create massive app stability issues. Next up would be overwhelming some downstream dependency, like a database, with too many connections because you have too many activities running concurrently. High CPU is another one, which should be self-explanatory. These are all app-specific issues which will apply to some workloads but not to others. The fastest and most effective way to mitigate most of these problems when they impact production is to make a config change to reduce concurrency. Dapr Workflow users today have no such capability since we currently hardcode these throttles - hence this PR.

This might be a proper way to describe this to users, so they know to observe and tweak if needed. Please carry out this language to the associated docs issue.

Yes, this is absolutely an "observe and tweak if needed" configuration knob. I can make this clearer in the documentation.

When queued, will users get an ACK from the Dapr runtime immediately after queuing or will the client hang until its picked up?

Regardless of whether you're above or below the threshold, the client always gets an immediate ACK. Clients are never blocked on workflow operations. Similarly, workflows are never blocked on scheduling activity calls. These existing throttles are expected only to add latency between the time you schedule some work and the time when that work starts executing.

artursouza
artursouza previously approved these changes Nov 14, 2023
pkg/apis/configuration/v1alpha1/types.go Outdated Show resolved Hide resolved
pkg/apis/configuration/v1alpha1/types.go Outdated Show resolved Hide resolved
pkg/config/configuration.go Outdated Show resolved Hide resolved
pkg/config/configuration.go Outdated Show resolved Hide resolved
Signed-off-by: Chris Gillum <cgillum@microsoft.com>
Comment on lines 529 to 532
if !wf.cachingDisabled {
// update cached state
wf.states.Store(actorID, state)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the same check at the start of the function, at line 514?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in saveInternalState the cache is being updated even if the TransactionStateOperation fails lines 539 and 549?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly necessary for 514 since the cache should be empty.

For 539, 549, yes this needs to be fixed. That issue is why some Cosmos DB compatibility problems weren’t more quickly detected. Could do it now or in another PR.

@mukundansundar
Copy link
Contributor

/ok-to-test

@mukundansundar
Copy link
Contributor

/test-version-skew

@dapr-bot
Copy link
Collaborator

dapr-bot commented Nov 15, 2023

Dapr E2E test

🔗 Link to Action run

Commit ref: e32647e

✅ Build succeeded for linux/amd64

  • Image tag: dapre2e5f2fa5fcd1l
  • Test image tag: dapre2e5f2fa5fcd1l

✅ Infrastructure deployed

Cluster Resource group name Azure region
Linux Dapr-E2E-dapre2e5f2fa5fcd1l westus3
Windows Dapr-E2E-dapre2e5f2fa5fcd1w westus3
Linux/arm64 Dapr-E2E-dapre2e5f2fa5fcd1la eastus

✅ Build succeeded for windows/amd64

  • Image tag: dapre2e5f2fa5fcd1w
  • Test image tag: dapre2e5f2fa5fcd1w

❌ Tests failed on windows/amd64

Please check the logs for details on the error.

❌ Tests failed on linux/amd64

Please check the logs for details on the error.

@dapr-bot
Copy link
Collaborator

dapr-bot commented Nov 15, 2023

Dapr Version Skew test (dapr-sidecar-master - 1.12.0)

🔗 Link to Action run

Commit ref: e32647e

❌ Version Skew tests failed

Please check the logs for details on the error.

@dapr-bot
Copy link
Collaborator

dapr-bot commented Nov 15, 2023

Dapr Version Skew test (control-plane-master - 1.12.0)

🔗 Link to Action run

Commit ref: e32647e

✅ Version Skew tests passed

@dapr-bot
Copy link
Collaborator

dapr-bot commented Nov 21, 2023

Dapr Version Skew test (dapr-sidecar-master - 1.12.2)

🔗 Link to Action run

Commit ref: e32647e

❌ Version Skew tests failed

Please check the logs for details on the error.

@mukundansundar
Copy link
Contributor

/ok-to-test

@dapr-bot
Copy link
Collaborator

dapr-bot commented Nov 21, 2023

Dapr E2E test

🔗 Link to Action run

Commit ref: bf86875

✅ Build succeeded for linux/amd64

  • Image tag: dapre2e26b9f10ea9l
  • Test image tag: dapre2e26b9f10ea9l

✅ Infrastructure deployed

Cluster Resource group name Azure region
Linux Dapr-E2E-dapre2e26b9f10ea9l westus3
Windows Dapr-E2E-dapre2e26b9f10ea9w westus3
Linux/arm64 Dapr-E2E-dapre2e26b9f10ea9la eastus

✅ Build succeeded for windows/amd64

  • Image tag: dapre2e26b9f10ea9w
  • Test image tag: dapre2e26b9f10ea9w

✅ Tests succeeded on windows/amd64

  • Image tag: dapre2e26b9f10ea9w
  • Test image tag: dapre2e26b9f10ea9w

✅ Tests succeeded on linux/amd64

  • Image tag: dapre2e26b9f10ea9l
  • Test image tag: dapre2e26b9f10ea9l

@mukundansundar mukundansundar added automerge Allows DaprBot to automerge and update PR if all approvals are in place autoupdate DaprBot will keep the Pull Request up to date with master branch labels Nov 21, 2023
@cgillum
Copy link
Contributor Author

cgillum commented Nov 21, 2023

Tests are looking good. Are we good to merge? @mukundansundar @artursouza @yaron2

I see that there's an automerge label added, but I'm not sure what the criteria for that is.

@artursouza artursouza merged commit bef0ca2 into dapr:master Nov 21, 2023
31 of 32 checks passed
@JoshVanL JoshVanL added this to the v1.13 milestone Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge Allows DaprBot to automerge and update PR if all approvals are in place autoupdate DaprBot will keep the Pull Request up to date with master branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants