Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement up --continue-on-error #15740

Merged
merged 22 commits into from
Apr 22, 2024

Conversation

tgummerer
Copy link
Collaborator

@tgummerer tgummerer commented Mar 20, 2024

Similar to destroy --continue-on-error, this flag allows pulumi up
to continue if any errors are encountered.

Currently when we encounter an error while creating/updating a
resource, we cancel the context of the deployment executor, and thus
the deployment stops once the resources that are being processed in
parallel with the failed one finish being updated.

For --continue-on-error, we ignore these errors, and let the
deployment executor continue. In order for the deployment executor to
exit eventually we also have to mark these steps as done, as the
deployment executor will otherwise just hang, and callers with open
channels waiting for it to finish/report back will hang indefinitely.

The errors in the step will still be reported back to the user by the
OnResourceStepPost callback.

Fixes #14515

@pulumi-bot
Copy link
Contributor

pulumi-bot commented Mar 20, 2024

Changelog

[uncommitted] (2024-04-22)

Features

  • [engine] Add a --continue-on-error flag to pulumi up
    #15740

@tgummerer tgummerer force-pushed the tg/up-continue-on-error-really-for-up-now branch 3 times, most recently from 6fb4216 to 7fadab1 Compare March 22, 2024 10:21
@tgummerer tgummerer changed the title implement up --continue-on-error Implement up --continue-on-error Mar 22, 2024
@tgummerer tgummerer marked this pull request as ready for review March 22, 2024 10:23
@tgummerer tgummerer requested a review from a team as a code owner March 22, 2024 10:23
pkg/engine/lifecycletest/pulumi_test.go Outdated Show resolved Hide resolved
pkg/engine/lifecycletest/pulumi_test.go Outdated Show resolved Hide resolved
pkg/resource/deploy/step.go Outdated Show resolved Hide resolved
return &pulumirpc.RegisterResourceResponse{
Urn: string(result.State.URN),
Id: string(result.State.ID),
Object: nil,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the SDK gets a response that the resource did register, but just has no outputs, and might have an empty ID?
That seems suspect, I don't trust SDKs are going to handle this well vis-a-vis working out unknown vs undefined vs error'd output states.

proto/pulumi/resource.proto Outdated Show resolved Hide resolved
@@ -307,6 +307,7 @@ To manipulate the value of this Output, use '.apply' instead.`);
public apply<U>(func: (t: T) => Input<U>, runWithUnknowns?: boolean): Output<U> {
// we're inside the modern `output` code, so it's safe to call `.allResources!` here.

runWithUnknowns = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to minimise blast radius we could do a pass removing the ifDryRun apply logic from nodejs as it's own PR. Nothing should really depend on that, and we can sanity check just that change is safe.

@tgummerer tgummerer force-pushed the tg/up-continue-on-error-really-for-up-now branch 3 times, most recently from 4c2cab2 to 575a9c5 Compare April 15, 2024 16:17
@tgummerer tgummerer force-pushed the tg/up-continue-on-error-really-for-up-now branch 2 times, most recently from 4f8cf84 to 7913740 Compare April 16, 2024 09:12
@tgummerer tgummerer force-pushed the tg/up-continue-on-error-really-for-up-now branch from 538a6ef to e7a7fb1 Compare April 16, 2024 10:59
tgummerer added a commit to pulumi/pulumi-dotnet that referenced this pull request Apr 16, 2024
Similar to what we're doing to the other SDKs in
pulumi/pulumi#15740, this enables dealing with
the SkipReason field in the RegisterResource response.
tgummerer added a commit to pulumi/pulumi-dotnet that referenced this pull request Apr 16, 2024
Similar to what we're doing to the other SDKs in
pulumi/pulumi#15740, this enables dealing with
the SkipReason field in the RegisterResource response.
cmd/pulumi-test-language/interface.go Show resolved Hide resolved
pkg/resource/deploy/deployment_executor.go Outdated Show resolved Hide resolved
Comment on lines 92 to 93
Failed bool // true if the resource registration failed.
Skipped bool // true if the resource registration was skippeg.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dislike that this is a tri-state express as two bools :( Maybe enum it instead as something like SUCCESS/FAILED/SKIPPED?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah an enum would be much better. Done in 9ac2fde

proto/pulumi/resource.proto Outdated Show resolved Hide resolved
@tgummerer tgummerer force-pushed the tg/up-continue-on-error-really-for-up-now branch 3 times, most recently from 2b2ffb4 to 39f6e58 Compare April 18, 2024 06:53
}

programF := deploytest.NewLanguageRuntimeF(func(_ plugin.RunInfo, monitor *deploytest.ResourceMonitor) error {
stackURN, _, _, _, err := monitor.RegisterResource(resource.RootStackType, "test", false)
failing, _, _, _, err := monitor.RegisterResource("pkgB:m:typB", "failing", true, deploytest.ResourceOptions{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update RegisterResource here to return the Reason field? Maybe join all the fields into a struct at this point rather than adding another return value, could struct-ify in another PR to keep changes small then just rebase and add Reason to the struct in this PR. I don't like that there's a critical response being returned here that we're not even looking at.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sounds like a good idea: done in #15988.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done now. I've rebased this PR and temporarily based this PR on that branch to make reviewing easier until that can go through the merge queue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR merged now, so this is again based on master.

err = monitor.RegisterResourceOutputs(stackURN, outputs)
independent1, _, _, _, err := monitor.RegisterResource(
"pkgA:m:typA", "independent1", true, deploytest.ResourceOptions{
SupportsResultReporting: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have another test that does the same as this one but without SupportsResultReporting and check the engine still handles that correctly (for old sdks)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done for this and below ✔️

programF := deploytest.NewLanguageRuntimeF(func(_ plugin.RunInfo, monitor *deploytest.ResourceMonitor) error {
stackURN, _, _, _, err := monitor.RegisterResource(resource.RootStackType, "test", false)
_, _, _, _, err := monitor.RegisterResource("pkgB:m:typB", "failing", true, deploytest.ResourceOptions{
SupportsResultReporting: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, we should have a version of this without SupportsResultReporting for testing old SDKs

@@ -445,8 +449,36 @@ func (ex *deploymentExecutor) handleSingleEvent(event SourceEvent) error {
if err != nil {
return err
}
// Exclude the steps that depend on errored steps if ContinueOnError is set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the deployment_executor responsible for this rather than the step generator, or even the source eval? We know the step generator/source eval will not see the new registration till the step_executor has finished it because it's dependent.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few reasons I kept it here:

  • It matches what destroy --continue-on-error does, and I think that needs to live in the deployment executor because the step generator generates steps for all deletes and doesn't know which ones failed.
  • These steps do exist, they just don't get executed for deployment. So I think the deployment executor is best suited to make a decision on that. The step generator also has no access to the step executors failed steps, so we'd need to inject them there somewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough

@tgummerer tgummerer force-pushed the tg/up-continue-on-error-really-for-up-now branch from 39f6e58 to 93cc756 Compare April 19, 2024 10:47
@tgummerer tgummerer changed the base branch from master to tg/structify-deploytest-register-resource-response April 19, 2024 10:47
@tgummerer tgummerer force-pushed the tg/up-continue-on-error-really-for-up-now branch from c2f68e5 to a519bc2 Compare April 19, 2024 12:19
@tgummerer tgummerer added this pull request to the merge queue Apr 22, 2024
@tgummerer tgummerer removed this pull request from the merge queue due to a manual request Apr 22, 2024
@tgummerer tgummerer added this pull request to the merge queue Apr 22, 2024
Merged via the queue into master with commit 4169755 Apr 22, 2024
49 checks passed
@tgummerer tgummerer deleted the tg/up-continue-on-error-really-for-up-now branch April 22, 2024 13:29
github-merge-queue bot pushed a commit to pulumi/pulumi-dotnet that referenced this pull request Apr 22, 2024
* make dotnet SDK support SkipReason

Similar to what we're doing to the other SDKs in
pulumi/pulumi#15740, this enables dealing with
the SkipReason field in the RegisterResource response.

* add changelog

* import latest proto files and do fixups
github-merge-queue bot pushed a commit that referenced this pull request Apr 23, 2024
Users are already able to use --continue-on-error for destroy in the
automation API. Implement the same for `up` as well.

This should only be merged after
#15740 is. (Note that this still
includes the commits from there, I will rebase once that PR is merged,
hence this is only a draft PR for now)
github-merge-queue bot pushed a commit that referenced this pull request Apr 25, 2024
Tentative changelog, but also planning to include
#16057

### Features

- [auto/{go,nodejs,python}] Add support for the continue-on-error
parameter of the up command to the Automation API
  [#15953](#15953)

- [engine] Add a --continue-on-error flag to pulumi up
  [#15740](#15740)


### Bug Fixes

- [sdk/nodejs] Fix a race condition that could cause the NodeJS runtime
to terminate before finishing all work
  [#16005](#16005)

- [sdk/python] Fix an exception when setting providers resource option
with a dict
  [#16022](#16022)

- [sdk/python] Fix event loop tracking in the python SDK when using
remote transforms
  [#16039](#16039)

- [sdk/python] Workaround lazy module loading regression
  [#16038](#16038)


### Miscellaneous

- [cli/plugin] Move PluginKind type definition into apitype and
re-export for backward compatibility
@justinvp justinvp mentioned this pull request Apr 25, 2024
github-merge-queue bot pushed a commit that referenced this pull request Apr 26, 2024
Will wait to merge this until after
#16057 merges

### Features

- [auto/{go,nodejs,python}] Add support for the continue-on-error
parameter of the up command to the Automation API
  [#15953](#15953)

- [engine] Add a --continue-on-error flag to pulumi up
  [#15740](#15740)


### Bug Fixes

- [sdk/nodejs] Fix a race condition that could cause the NodeJS runtime
to terminate before finishing all work
  [#16005](#16005)

- [sdk/python] Fix an exception when setting providers resource option
with a dict
  [#16022](#16022)

- [sdk/python] Fix event loop tracking in the python SDK when using
remote transforms
  [#16039](#16039)

- [sdk/python] Workaround lazy module loading regression
  [#16038](#16038)


### Miscellaneous

- [cli/plugin] Move PluginKind type definition into apitype and
re-export for backward compatibility
  [#15946](#15946)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a "continue on error" flag
3 participants