Implement up --continue-on-error #15740

tgummerer · 2024-03-20T16:23:57Z

Similar to destroy --continue-on-error, this flag allows pulumi up
to continue if any errors are encountered.

Currently when we encounter an error while creating/updating a
resource, we cancel the context of the deployment executor, and thus
the deployment stops once the resources that are being processed in
parallel with the failed one finish being updated.

For --continue-on-error, we ignore these errors, and let the
deployment executor continue. In order for the deployment executor to
exit eventually we also have to mark these steps as done, as the
deployment executor will otherwise just hang, and callers with open
channels waiting for it to finish/report back will hang indefinitely.

The errors in the step will still be reported back to the user by the
OnResourceStepPost callback.

Fixes #14515

pulumi-bot · 2024-03-20T16:24:41Z

Changelog

[uncommitted] (2024-04-22)

Features

[engine] Add a --continue-on-error flag to pulumi up
#15740

pkg/engine/lifecycletest/pulumi_test.go

pkg/resource/deploy/step.go

Frassle · 2024-03-26T10:59:47Z

pkg/resource/deploy/source_eval.go

+			return &pulumirpc.RegisterResourceResponse{
+				Urn:                  string(result.State.URN),
+				Id:                   string(result.State.ID),
+				Object:               nil,


So the SDK gets a response that the resource did register, but just has no outputs, and might have an empty ID?
That seems suspect, I don't trust SDKs are going to handle this well vis-a-vis working out unknown vs undefined vs error'd output states.

proto/pulumi/resource.proto

Frassle · 2024-04-10T07:32:13Z

sdk/nodejs/output.ts

@@ -307,6 +307,7 @@ To manipulate the value of this Output, use '.apply' instead.`);
    public apply<U>(func: (t: T) => Input<U>, runWithUnknowns?: boolean): Output<U> {
        // we're inside the modern `output` code, so it's safe to call `.allResources!` here.

+        runWithUnknowns = true;


I think to minimise blast radius we could do a pass removing the ifDryRun apply logic from nodejs as it's own PR. Nothing should really depend on that, and we can sanity check just that change is safe.

cmd/pulumi-test-language/testdata/l2-failed-create-continue-on-error/main.pp

Similar to what we're doing to the other SDKs in pulumi/pulumi#15740, this enables dealing with the SkipReason field in the RegisterResource response.

cmd/pulumi-test-language/interface.go

pkg/resource/deploy/deployment_executor.go

Frassle · 2024-04-17T13:53:35Z

pkg/resource/deploy/source.go

+	Failed  bool            // true if the resource registration failed.
+	Skipped bool            // true if the resource registration was skippeg.


Dislike that this is a tri-state express as two bools :( Maybe enum it instead as something like SUCCESS/FAILED/SKIPPED?

Yeah an enum would be much better. Done in 9ac2fde

proto/pulumi/resource.proto

Frassle · 2024-04-19T07:14:11Z

pkg/engine/lifecycletest/pulumi_test.go

 	}

 	programF := deploytest.NewLanguageRuntimeF(func(_ plugin.RunInfo, monitor *deploytest.ResourceMonitor) error {
-		stackURN, _, _, _, err := monitor.RegisterResource(resource.RootStackType, "test", false)
+		failing, _, _, _, err := monitor.RegisterResource("pkgB:m:typB", "failing", true, deploytest.ResourceOptions{


Should we update RegisterResource here to return the Reason field? Maybe join all the fields into a struct at this point rather than adding another return value, could struct-ify in another PR to keep changes small then just rebase and add Reason to the struct in this PR. I don't like that there's a critical response being returned here that we're not even looking at.

Yeah that sounds like a good idea: done in #15988.

Done now. I've rebased this PR and temporarily based this PR on that branch to make reviewing easier until that can go through the merge queue.

The PR merged now, so this is again based on master.

Frassle · 2024-04-19T07:14:49Z

pkg/engine/lifecycletest/pulumi_test.go

-		err = monitor.RegisterResourceOutputs(stackURN, outputs)
+		independent1, _, _, _, err := monitor.RegisterResource(
+			"pkgA:m:typA", "independent1", true, deploytest.ResourceOptions{
+				SupportsResultReporting: true,


Should we have another test that does the same as this one but without SupportsResultReporting and check the engine still handles that correctly (for old sdks)

Good idea, done for this and below ✔️

Frassle · 2024-04-19T07:15:32Z

pkg/engine/lifecycletest/pulumi_test.go

 	programF := deploytest.NewLanguageRuntimeF(func(_ plugin.RunInfo, monitor *deploytest.ResourceMonitor) error {
-		stackURN, _, _, _, err := monitor.RegisterResource(resource.RootStackType, "test", false)
+		_, _, _, _, err := monitor.RegisterResource("pkgB:m:typB", "failing", true, deploytest.ResourceOptions{
+			SupportsResultReporting: true,


Same as above, we should have a version of this without SupportsResultReporting for testing old SDKs

Frassle · 2024-04-19T07:18:26Z

pkg/resource/deploy/deployment_executor.go

@@ -445,8 +449,36 @@ func (ex *deploymentExecutor) handleSingleEvent(event SourceEvent) error {
 	if err != nil {
 		return err
 	}
+	// Exclude the steps that depend on errored steps if ContinueOnError is set.


Why is the deployment_executor responsible for this rather than the step generator, or even the source eval? We know the step generator/source eval will not see the new registration till the step_executor has finished it because it's dependent.

There's a few reasons I kept it here:

It matches what destroy --continue-on-error does, and I think that needs to live in the deployment executor because the step generator generates steps for all deletes and doesn't know which ones failed.

These steps do exist, they just don't get executed for deployment. So I think the deployment executor is best suited to make a decision on that. The step generator also has no access to the step executors failed steps, so we'd need to inject them there somewhere.

Fair enough

Co-authored-by: Fraser Waters <fraser@pulumi.com>

* make dotnet SDK support SkipReason Similar to what we're doing to the other SDKs in pulumi/pulumi#15740, this enables dealing with the SkipReason field in the RegisterResource response. * add changelog * import latest proto files and do fixups

Users are already able to use --continue-on-error for destroy in the automation API. Implement the same for `up` as well. This should only be merged after #15740 is. (Note that this still includes the commits from there, I will rebase once that PR is merged, hence this is only a draft PR for now)

Tentative changelog, but also planning to include #16057 ### Features - [auto/{go,nodejs,python}] Add support for the continue-on-error parameter of the up command to the Automation API [#15953](#15953) - [engine] Add a --continue-on-error flag to pulumi up [#15740](#15740) ### Bug Fixes - [sdk/nodejs] Fix a race condition that could cause the NodeJS runtime to terminate before finishing all work [#16005](#16005) - [sdk/python] Fix an exception when setting providers resource option with a dict [#16022](#16022) - [sdk/python] Fix event loop tracking in the python SDK when using remote transforms [#16039](#16039) - [sdk/python] Workaround lazy module loading regression [#16038](#16038) ### Miscellaneous - [cli/plugin] Move PluginKind type definition into apitype and re-export for backward compatibility

Will wait to merge this until after #16057 merges ### Features - [auto/{go,nodejs,python}] Add support for the continue-on-error parameter of the up command to the Automation API [#15953](#15953) - [engine] Add a --continue-on-error flag to pulumi up [#15740](#15740) ### Bug Fixes - [sdk/nodejs] Fix a race condition that could cause the NodeJS runtime to terminate before finishing all work [#16005](#16005) - [sdk/python] Fix an exception when setting providers resource option with a dict [#16022](#16022) - [sdk/python] Fix event loop tracking in the python SDK when using remote transforms [#16039](#16039) - [sdk/python] Workaround lazy module loading regression [#16038](#16038) ### Miscellaneous - [cli/plugin] Move PluginKind type definition into apitype and re-export for backward compatibility [#15946](#15946)

tgummerer force-pushed the tg/up-continue-on-error-really-for-up-now branch 3 times, most recently from 6fb4216 to 7fadab1 Compare March 22, 2024 10:21

tgummerer changed the title ~~implement up --continue-on-error~~ Implement up --continue-on-error Mar 22, 2024

tgummerer marked this pull request as ready for review March 22, 2024 10:23

tgummerer requested a review from a team as a code owner March 22, 2024 10:23

Frassle reviewed Mar 22, 2024

View reviewed changes

pkg/engine/lifecycletest/pulumi_test.go Outdated Show resolved Hide resolved

pkg/engine/lifecycletest/pulumi_test.go Outdated Show resolved Hide resolved

pkg/resource/deploy/step.go Outdated Show resolved Hide resolved

Frassle reviewed Mar 26, 2024

View reviewed changes

lukehoban mentioned this pull request Mar 28, 2024

Add flag to allow as much as possible of a deployment to complete even after a failure #13306

Closed

tgummerer force-pushed the tg/up-continue-on-error-really-for-up-now branch from 4c87d7c to 841a964 Compare April 10, 2024 07:15

Frassle reviewed Apr 10, 2024

View reviewed changes

tgummerer force-pushed the tg/up-continue-on-error-really-for-up-now branch 3 times, most recently from 4c2cab2 to 575a9c5 Compare April 15, 2024 16:17

Frassle reviewed Apr 15, 2024

View reviewed changes

cmd/pulumi-test-language/testdata/l2-failed-create-continue-on-error/main.pp Outdated Show resolved Hide resolved

tgummerer force-pushed the tg/up-continue-on-error-really-for-up-now branch 2 times, most recently from 4f8cf84 to 7913740 Compare April 16, 2024 09:12

tgummerer mentioned this pull request Apr 16, 2024

Implement --continue-on-error for up in Automation API #15953

Merged

tgummerer force-pushed the tg/up-continue-on-error-really-for-up-now branch from 538a6ef to e7a7fb1 Compare April 16, 2024 10:59

tgummerer added a commit to pulumi/pulumi-dotnet that referenced this pull request Apr 16, 2024

make dotnet SDK support SkipReason

479e36c

Similar to what we're doing to the other SDKs in pulumi/pulumi#15740, this enables dealing with the SkipReason field in the RegisterResource response.

tgummerer added a commit to pulumi/pulumi-dotnet that referenced this pull request Apr 16, 2024

make dotnet SDK support SkipReason

6b8e909

Similar to what we're doing to the other SDKs in pulumi/pulumi#15740, this enables dealing with the SkipReason field in the RegisterResource response.

tgummerer mentioned this pull request Apr 16, 2024

Make dotnet SDK support result reporting pulumi/pulumi-dotnet#259

Merged

Frassle reviewed Apr 17, 2024

View reviewed changes

tgummerer force-pushed the tg/up-continue-on-error-really-for-up-now branch 3 times, most recently from 2b2ffb4 to 39f6e58 Compare April 18, 2024 06:53

Frassle reviewed Apr 19, 2024

View reviewed changes

tgummerer force-pushed the tg/up-continue-on-error-really-for-up-now branch from 39f6e58 to 93cc756 Compare April 19, 2024 10:47

tgummerer changed the base branch from master to tg/structify-deploytest-register-resource-response April 19, 2024 10:47

tgummerer and others added 13 commits April 19, 2024 14:19

support skipreason in Go sdk

358f042

make linter happy

14ba9bc

fail_on_create 2.0.0 -> 3.0.0

6414709

remove extra resources

d314aaf

fix conformance tests

8abca7b

Update pkg/resource/deploy/deployment_executor.go

ff9d916

Co-authored-by: Fraser Waters <fraser@pulumi.com>

make result a trystate

3ed2025

add conformance test test

c3e57cb

rename SkipReason -> result

31894da

fixup nodejs

dad8074

add SDK test and check Result in RPC response

a62401e

re-add test that got removed in a mis-merge

84a6148

lint fix

a519bc2

tgummerer force-pushed the tg/up-continue-on-error-really-for-up-now branch from c2f68e5 to a519bc2 Compare April 19, 2024 12:19

Frassle approved these changes Apr 22, 2024

View reviewed changes

tgummerer added this pull request to the merge queue Apr 22, 2024

pass correct type hopefully

73a7df4

tgummerer removed this pull request from the merge queue due to a manual request Apr 22, 2024

tgummerer enabled auto-merge April 22, 2024 10:53

tgummerer added this pull request to the merge queue Apr 22, 2024

Merged via the queue into master with commit 4169755 Apr 22, 2024
49 checks passed

tgummerer deleted the tg/up-continue-on-error-really-for-up-now branch April 22, 2024 13:29

justinvp mentioned this pull request Apr 22, 2024

[Epic] pulumi up/destroy --continue-on-error #16035

Closed

12 tasks

justinvp mentioned this pull request Apr 25, 2024

Prepare for v3.114.0 release #16061

Merged

justinvp mentioned this pull request Apr 25, 2024

Freeze v3.114.0 #16062

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement up --continue-on-error #15740

Implement up --continue-on-error #15740

tgummerer commented Mar 20, 2024 •

edited

pulumi-bot commented Mar 20, 2024 •

edited

Frassle Mar 26, 2024

Frassle Apr 10, 2024

Frassle Apr 17, 2024

tgummerer Apr 17, 2024

Frassle Apr 19, 2024

tgummerer Apr 19, 2024

tgummerer Apr 19, 2024

tgummerer Apr 19, 2024

Frassle Apr 19, 2024

tgummerer Apr 19, 2024

Frassle Apr 19, 2024

Frassle Apr 19, 2024

tgummerer Apr 19, 2024

Frassle Apr 19, 2024

		Failed bool // true if the resource registration failed.
		Skipped bool // true if the resource registration was skippeg.

Implement up --continue-on-error #15740

Implement up --continue-on-error #15740

Conversation

tgummerer commented Mar 20, 2024 • edited

pulumi-bot commented Mar 20, 2024 • edited

Changelog

[uncommitted] (2024-04-22)

Features

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgummerer commented Mar 20, 2024 •

edited

pulumi-bot commented Mar 20, 2024 •

edited