Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easier way to remove/clear pending operations #4265

Closed
mitchellmaler opened this issue Apr 1, 2020 · 28 comments · Fixed by #9293 or #10394
Closed

Easier way to remove/clear pending operations #4265

mitchellmaler opened this issue Apr 1, 2020 · 28 comments · Fixed by #9293 or #10394
Assignees
Labels
area/cli UX of using the CLI (args, output, logs) kind/enhancement Improvements or new features resolution/fixed This issue was fixed
Milestone

Comments

@mitchellmaler
Copy link

Today I am testing a bunch with Pulumi which includes some resources not coming up correctly or even me just canceling it to change some code to restart it.
Currently I have to export, delete the pending operations, then import the stack almost each time which is pretty tedious.
Could a stack command be added to just automatically clear pending operations that way it is more user friendly?

@lukehoban
Copy link
Member

We should also seriously reconsider whether we should block updates on "pending operations" at all. We could choose to warn instead of error - allowing updates to continue. By far the most common case for "pending operations" is a leaked resource creation - which does not actually put a stack in a "stuck" state - it just suggests there may be an out-of-band cleanup step needed.

@lukehoban
Copy link
Member

My concrete suggestion:

  1. Make the current error a warning - and clean up the text of the message
  2. Add a pulumi state clear-pending-operations that takes care of removing these pending operations (and mention it in the warning)

@pgavlin
Copy link
Member

pgavlin commented Nov 3, 2020

We could choose to warn instead of error - allowing updates to continue. By far the most common case for "pending operations" is a leaked resource creation

We could be a bit more precise here and only warn on pending creates. We could continue to error on pending deletes and updates.

We should also fix refresh s.t. it clears pending operations: it was always the intent that that command would resolve these operations.

@pgavlin
Copy link
Member

pgavlin commented Nov 3, 2020

For more context, refresh was the intended solution because the problem with pending operations is that we don't know the actual state of the resource being changed. In the case of updates and deletes, refresh can give us the actual state and resolve the pending operation.

@lukehoban
Copy link
Member

Good points - updated suggestion:

  1. Make the current error a warning
  2. Make pulumi refresh report a warning on pending creates, and clear out pending_operations.
  3. Simplify the warning in (1) to (a) point users to run pulumi refresh (b) be simpler and more specific.

@hugbubby
Copy link

hugbubby commented Feb 24, 2021

The comments in this thread almost speak as if this is an unresolved IaC problem. I'm confused as to why there doesn't seem to be a mirror issue in terraform, where I have literally never encountered issues interrupting an apply before. What do they do and why isn't it applicable here?

@pgavlin
Copy link
Member

pgavlin commented Feb 24, 2021

I'm confused as to why there doesn't seem to be a mirror issue in terraform, where I have literally never encountered issues interrupting an apply before. What do they do and why isn't it applicable here?

AFAIK, Terraform does not use a model its state as a journal. Instead, their approach relies on refreshing prior to each update in order to detect the current state of resources that might have been interrupted while updating. That approach leaves out the possibility of warning on interrupted creates, etc., and imposes the overhead of refreshing by default.

@hugbubby
Copy link

hugbubby commented Apr 9, 2021

@pgavlin

I'm confused as to why there doesn't seem to be a mirror issue in terraform, where I have literally never encountered issues interrupting an apply before. What do they do and why isn't it applicable here?

AFAIK, Terraform does not use a model its state as a journal. Instead, their approach relies on refreshing prior to each update in order to detect the current state of resources that might have been interrupted while updating. That approach leaves out the possibility of warning on interrupted creates, etc., and imposes the overhead of refreshing by default.

We ran an automated system for around a year that regularly interrupted terraform in the middle of resource creation several times a day, and there were no problems with interrupted creates/dangling resources. There was no need to warn in these cases, because terraform always made sure next time to determine what happened to the resource being created from the last apply. Part of the reason we went with terraform for creating this system was its robustness in this respect; no matter what bugs we introduced, we could always make sure that when the dust settled, what was in the terraform state accurately represented what was deployed. As far as I can tell, pulumi's chosen decision to make this event a warning instead of fixing via a refresh doesn't lead to any increase in safety; in fact it's increasing the room for error, because it's relying on (sometimes nonhuman) operators to remove potentially very expensive resources instead of forget about them.

We very much enjoy the speed, but since pulumi does have a mostly equivalent refresh function, it seems to me like it should be able to automatically get() and import schrodinger's resources like terraform does. It doesn't need to do so every time, but in the case of an interrupted apply, someone is going to perform this refresh - pulumi or me. Without this feature, the automation API becomes very tricky to run in any kind of real production environment where containers/processes are expected to be killable at all times. And that previously desirable property of terraform where we could always terraform destroy to clean up all of our resources is pretty much gone.

@farvour
Copy link

farvour commented Nov 16, 2021

This happens often enough to warrant commenting on this that one has to create an out-of-band job that basically does all the above described:

1.) Export project and stack file.
2.) Use jq or something to wipe pending_operations
3.) Re-import state back into stack.

As hinted above, a nice flag like --ignore-pending-operations could (and should) do all of this internally for us. Maybe have it depend on the --refresh flag even, since during an interrupted run we generally would want to trust the live state of the provider.

@farvour
Copy link

farvour commented Nov 16, 2021

I'm confused as to why there doesn't seem to be a mirror issue in terraform, where I have literally never encountered issues interrupting an apply before. What do they do and why isn't it applicable here?

AFAIK, Terraform does not use a model its state as a journal. Instead, their approach relies on refreshing prior to each update in order to detect the current state of resources that might have been interrupted while updating. That approach leaves out the possibility of warning on interrupted creates, etc., and imposes the overhead of refreshing by default.

I'm totally fine with that overhead of requiring --refresh if another flag to purge pending ops is included in the CLI. We do that anyway for production deployment pipelines because it's just a Good Idea anyway.. never solely trust the state file IMO.

@mitchellmaler
Copy link
Author

My current workaround is to run this each time I run into the issue. I created a bash alias/function to make it easy to run.
pulumi stack export | jq '.deployment.pending_operations = []' | pulumi stack import --

pgavlin added a commit that referenced this issue Nov 16, 2021
Just what it says on the tin. This is implemented by moving the check
for pending operations in the last statefile into the deployment
executor and making it conditional on whether or not a refresh is being
performed (either via `pulumi refresh` or `pulumi up -r`). Because
pending operations are not carried over from the base statefile, this
has the effect of clearing pending operations if a refresh is performed.

Fixes #4265.
@emiliza emiliza added this to the 0.65 milestone Nov 16, 2021
@emiliza emiliza added capacity/community-customer size/S Estimated effort to complete (1-2 days). labels Nov 16, 2021
pgavlin added a commit that referenced this issue Nov 18, 2021
Just what it says on the tin. This is implemented by moving the check
for pending operations in the last statefile into the deployment
executor and making it conditional on whether or not a refresh is being
performed (either via `pulumi refresh` or `pulumi up -r`). Because
pending operations are not carried over from the base statefile, this
has the effect of clearing pending operations if a refresh is performed.

Fixes #4265.
@emiliza emiliza modified the milestones: 0.65, 0.66 Dec 6, 2021
@leezen leezen modified the milestones: 0.66, 0.67 Jan 5, 2022
@nesl247
Copy link
Contributor

nesl247 commented Apr 6, 2022

Given that there is still manual work to be done for create operations, shouldn't this remain open then @Zaid-Ajaj ?

@Zaid-Ajaj
Copy link
Contributor

Given that there is still manual work to be done for create operations, shouldn't this remain open then?

Hi @nesl247, the reason manual work is still required for pending CREATE operations is because we need the users to explicitly decide what to do with them. If pulumi just clears those, you might end up with live cloud resources that are not tracked by pulumi (orphaned resources) and it's better if the users see this and manually check these live resources before clearing them from the pulumi checkpoint files

@nesl247
Copy link
Contributor

nesl247 commented Apr 6, 2022

Given that there is still manual work to be done for create operations, shouldn't this remain open then?

Hi @nesl247, the reason manual work is still required for pending CREATE operations is because we need the users to explicitly decide what to do with them. If pulumi just clears those, you might end up with live cloud resources that are not tracked by pulumi (orphaned resources) and it's better if the users see this and manually check these live resources before clearing them from the pulumi checkpoint files

Understood, but it still requires more manual work than should be required, therefore this issue should remain open IMO. I would expect Pulumi to try to refresh to see if the resource was actually created, if it can't get an answer, then to ask the user if they want to remove it from the stack. There should not be any requirement to manually edit the stack.

@Frassle
Copy link
Member

Frassle commented Apr 6, 2022

I would expect Pulumi to try to refresh to see if the resource was actually created

We can't. The big problem with pending Creates is that we don't have an ID to lookup because the Create call never returned to give us one, so we can't look up to see if it was created or not.

Having said that there is scope for us at least to provide easier to use commands to manually resolve pending creates as either "did not create" or "created with id ", and maybe we can come up with something better. I'll leave this open to track that.

@Frassle Frassle reopened this Apr 6, 2022
@hcharley
Copy link

hcharley commented Apr 8, 2022

Regarding pending creates, is it too much scope creep to use fuzzy name matching and create times when there is no ID?

I realize this probably involves listing and/or searching resources from their provider, but this functionality could also be useful for auto-importing during the Pulumi onboarding process.

A UI or CLI could allow users of Pulumi to more easily import their Cloud resources.

I looked in the docs and while I saw get function, I didn't see a list or similar function, so I assume this functionality doesn't exist and would be quite an undertaking given the number of resources available. So totally understand if this is scope creep, especially in regards to this ticket.

@Frassle
Copy link
Member

Frassle commented Apr 11, 2022

Regarding pending creates, is it too much scope creep to use fuzzy name matching and create times when there is no ID?

It's a really nice idea, but yes it's probably massive scope creep given we'd also need to go and update all the providers to support this as well.

I was thinking of adding commands that we would describe in the warning text, letting you do things like:
pulumi state resolve-create <urn> <id> to resolve the create as successful to the given ID
pulumi state clear-create <urn1> <urn2> to clear the pending creates as unsuccessful.
pulumi state clear-create --all to clear all pending creates.

Not 100% sure that's the right set of commands, but seems like a good starting point to come up with something to help with this last bit of pending-operations pain.

Eventually we might be able to do the list idea as well, but that is hard.

@mikhailshilkov mikhailshilkov removed this from the 0.69 milestone May 16, 2022
@mikhailshilkov mikhailshilkov added area/cli UX of using the CLI (args, output, logs) and removed resolution/fixed This issue was fixed labels May 31, 2022
@iwahbe iwahbe self-assigned this Jun 2, 2022
@mikhailshilkov mikhailshilkov added this to the 0.74 milestone Jun 7, 2022
@mikhailshilkov mikhailshilkov modified the milestones: 0.74, 0.75 Jul 1, 2022
@dixler
Copy link
Contributor

dixler commented Jul 7, 2022

pulumi state delete --pending might be a more idiomatic command

edit: just saw that there's a PR #9778 with a lot of feedback

@iwahbe
Copy link
Member

iwahbe commented Sep 1, 2022

The resolution we agreed upon is different than what is contained in this thread. For a full description of the resolution, see closing pr: #10394.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment