New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Easier way to remove/clear pending operations #4265
Comments
We should also seriously reconsider whether we should block updates on "pending operations" at all. We could choose to warn instead of error - allowing updates to continue. By far the most common case for "pending operations" is a leaked resource creation - which does not actually put a stack in a "stuck" state - it just suggests there may be an out-of-band cleanup step needed. |
My concrete suggestion:
|
We could be a bit more precise here and only warn on pending creates. We could continue to error on pending deletes and updates. We should also fix |
For more context, |
Good points - updated suggestion:
|
The comments in this thread almost speak as if this is an unresolved IaC problem. I'm confused as to why there doesn't seem to be a mirror issue in terraform, where I have literally never encountered issues interrupting an apply before. What do they do and why isn't it applicable here? |
AFAIK, Terraform does not use a model its state as a journal. Instead, their approach relies on refreshing prior to each update in order to detect the current state of resources that might have been interrupted while updating. That approach leaves out the possibility of warning on interrupted creates, etc., and imposes the overhead of refreshing by default. |
We ran an automated system for around a year that regularly interrupted terraform in the middle of resource creation several times a day, and there were no problems with interrupted creates/dangling resources. There was no need to warn in these cases, because terraform always made sure next time to determine what happened to the resource being created from the last apply. Part of the reason we went with terraform for creating this system was its robustness in this respect; no matter what bugs we introduced, we could always make sure that when the dust settled, what was in the terraform state accurately represented what was deployed. As far as I can tell, pulumi's chosen decision to make this event a warning instead of fixing via a refresh doesn't lead to any increase in safety; in fact it's increasing the room for error, because it's relying on (sometimes nonhuman) operators to remove potentially very expensive resources instead of forget about them. We very much enjoy the speed, but since pulumi does have a mostly equivalent refresh function, it seems to me like it should be able to automatically get() and import schrodinger's resources like terraform does. It doesn't need to do so every time, but in the case of an interrupted apply, someone is going to perform this refresh - pulumi or me. Without this feature, the automation API becomes very tricky to run in any kind of real production environment where containers/processes are expected to be killable at all times. And that previously desirable property of terraform where we could always |
This happens often enough to warrant commenting on this that one has to create an out-of-band job that basically does all the above described: 1.) Export project and stack file. As hinted above, a nice flag like --ignore-pending-operations could (and should) do all of this internally for us. Maybe have it depend on the --refresh flag even, since during an interrupted run we generally would want to trust the live state of the provider. |
I'm totally fine with that overhead of requiring |
My current workaround is to run this each time I run into the issue. I created a bash alias/function to make it easy to run. |
Just what it says on the tin. This is implemented by moving the check for pending operations in the last statefile into the deployment executor and making it conditional on whether or not a refresh is being performed (either via `pulumi refresh` or `pulumi up -r`). Because pending operations are not carried over from the base statefile, this has the effect of clearing pending operations if a refresh is performed. Fixes #4265.
Just what it says on the tin. This is implemented by moving the check for pending operations in the last statefile into the deployment executor and making it conditional on whether or not a refresh is being performed (either via `pulumi refresh` or `pulumi up -r`). Because pending operations are not carried over from the base statefile, this has the effect of clearing pending operations if a refresh is performed. Fixes #4265.
Given that there is still manual work to be done for create operations, shouldn't this remain open then @Zaid-Ajaj ? |
Hi @nesl247, the reason manual work is still required for pending CREATE operations is because we need the users to explicitly decide what to do with them. If pulumi just clears those, you might end up with live cloud resources that are not tracked by pulumi (orphaned resources) and it's better if the users see this and manually check these live resources before clearing them from the pulumi checkpoint files |
Understood, but it still requires more manual work than should be required, therefore this issue should remain open IMO. I would expect Pulumi to try to refresh to see if the resource was actually created, if it can't get an answer, then to ask the user if they want to remove it from the stack. There should not be any requirement to manually edit the stack. |
We can't. The big problem with pending Creates is that we don't have an ID to lookup because the Create call never returned to give us one, so we can't look up to see if it was created or not. Having said that there is scope for us at least to provide easier to use commands to manually resolve pending creates as either "did not create" or "created with id ", and maybe we can come up with something better. I'll leave this open to track that. |
Regarding pending creates, is it too much scope creep to use fuzzy name matching and create times when there is no ID? I realize this probably involves listing and/or searching resources from their provider, but this functionality could also be useful for auto-importing during the Pulumi onboarding process. A UI or CLI could allow users of Pulumi to more easily import their Cloud resources. I looked in the docs and while I saw |
It's a really nice idea, but yes it's probably massive scope creep given we'd also need to go and update all the providers to support this as well. I was thinking of adding commands that we would describe in the warning text, letting you do things like: Not 100% sure that's the right set of commands, but seems like a good starting point to come up with something to help with this last bit of pending-operations pain. Eventually we might be able to do the list idea as well, but that is hard. |
edit: just saw that there's a PR #9778 with a lot of feedback |
The resolution we agreed upon is different than what is contained in this thread. For a full description of the resolution, see closing pr: #10394. |
Today I am testing a bunch with Pulumi which includes some resources not coming up correctly or even me just canceling it to change some code to restart it.
Currently I have to export, delete the pending operations, then import the stack almost each time which is pretty tedious.
Could a stack command be added to just automatically clear pending operations that way it is more user friendly?
The text was updated successfully, but these errors were encountered: