Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply optimizations for handling of condition checks #32123

Merged
merged 7 commits into from Nov 1, 2022

Conversation

jbardin
Copy link
Member

@jbardin jbardin commented Oct 31, 2022

The handling of conditions during apply can have adverse affects on very large configs, due to the handling of all existing instances even when there are no changes. With extremely large configurations, the execution time can go from a few minutes to many hours only for minor changes.

The first step is to ensure empty CheckResults in the state are normalized to prevent the state from being re-written when there are no semantic differences. Empty and nil slices are not handled consistently overall, but we will use an empty slice in the state zero value to match the other top-level structures.

Next we can skip adding NoOp changes to the apply graph when there are no conditions at all to evaluate from the configuration. The nodes themselves can have significant processing overhead when assembling the graph, especially for steps like recalculating individual instance dependencies.

Finally we need to avoid re-planning each instance and re-writing the state for every NoOp change during apply. The whole process has gotten more convoluted with the addition of condition checks, as preconditions are done during the NodeAbstractResourceInstance.plan method, but postconditions are handled outside of the abstract instance in the NodeApplyableResourceInstance.managedResourceExecute method, and repetition data needs to be shared between all these. I pulled the repetition data out of apply to simplify the early return in the short-term, but there is definitely some more refactoring to be done in this area.

Fixes #32060
Fixes #32071

@jbardin jbardin added the 1.3-backport If you add this label to a PR before merging, backport-assistant will open a new PR once merged label Oct 31, 2022
@jbardin jbardin requested a review from a team October 31, 2022 15:05
Copy link
Member

@apparentlymart apparentlymart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me as a tactical fix for the v1.3 series. Thanks for digging into this!

I left some notes inline about things that are more architectural, so I expect we'll ignore them for the purposes of merging this PR but I'd like to consider them for follow-up improvements in the v1.4 branch to hopefully have a couple less potential hazards for future maintenance.

func (n *NodeAbstractResourceInstance) apply(
ctx EvalContext,
state *states.ResourceInstanceObject,
change *plans.ResourceInstanceChange,
applyConfig *configs.Resource,
createBeforeDestroy bool) (*states.ResourceInstanceObject, instances.RepetitionData, tfdiags.Diagnostics) {
keyData instances.RepetitionData,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically anything which has access to the EvalContext can get the "repetition data" like this:

    exp := ctx.InstanceExpander()
    keyData := exp.GetResourceInstanceRepetitionData(n.Addr())

In the original design for this "instance expander" thing it was intended to act as the source of record for all of this expansion-related information specifically so that we wouldn't need to explicitly pass all of this contextual data around.

I do feel in two minds about it because it feels like a violation of Law of Demeter, but we already widely use ctx as a big bag of shared "global" context, including in this very function where it uses ctx to do various other things that aren't strictly this function's business to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I left this as close to the current code as possible. I did forget that the rep data was directly available though, because the existing apply code already re-evaluated it anyway. Switching out that code however runs into a no expansion has been registered panic, so I think there's something else going on here too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yes I suppose the fact that we've been passing it around through arguments meant that there was not previously any need to actually register it at the source of record. Unfortunate but not unexpected.

Hopefully in a future change we can find a suitable central point to evaluate this just once and register it in the expander, so it's clearer exactly whose responsibility it is to evaluate the for_each or count expressions.

@@ -69,7 +93,7 @@ func (t *DiffTransformer) Transform(g *Graph) error {
// run any condition checks associated with the object, to
// make sure that they still hold when considering the
// results of other changes.
update = true
update = t.hasConfigConditions(addr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel weird about making the DiffTransformer specifically aware of the preconditions in the config; this feels like some action at a distance that's likely to trip us up under future maintenance just has how not having no-op actions in the graph caused us to ship a subtly broken postconditions feature in the first place. 😖

This very direct approach does seem reasonable as a small, surgical fix for backporting into the v1.3 branch, but I wonder if there's a different way we could arrange this in the v1.4 branch (soon after this is merged) so that this will hopefully be less of a maintenance hazard.


One starting idea, though I've not thought deeply about it yet:

  • Add yet another optional interface that graph nodes can implement:

    type GraphNodeDuringApply interface {
        HasApplyTimeActions() bool
    }
  • In a new transformer after we've done all the ones that insert new nodes, prune from the graph any node which implements this interface and HasApplyTimeActions() returns false.

  • If even inserting them in the first place is problematic then we could optionally optimize this by having DiffTransformer pre-check for the interface itself and not even insert the node in the first place. But my hope is that this wouldn't be necessary because we could do the pruning early enough that it's done by the time we get into all of the expensive transformers that do edge analysis.

Then this "are there any conditions?" logic could live in GraphNodeApplyableResourceInstance (and similar), closer to the Execute function that it's acting as a gate for. It would presumably return false unless the action is something other than NoOp or there's at least one condition to check.

Since I'm explicitly not suggesting we do this refactoring as a blocker for this PR, happy to discuss this more in a more appropriate place after this is merged. 😀

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a few different approaches here, and settled on this config reference only meant as a stop-gap. I think since we do store the plan check results, the results of those could be included in the "diff" that we're handling here. I was running into problems with inconsistent handling of the results however, because those have separate codepaths from normal changes, and this turned out to be a safer and more reliable failsafe for now. Since unknown checks from planning also would indicate "HasApplyTimeActions", it may be worth seeing if we can incorporate the checks more closely with the plan object so they are always in sync.

If there are no changes, then there is no reason to create an apply
graph since all objects are known. We however do need the walk to match
the expected state structure. This is probably only cleanup of empty
nested modules and outputs, but some investigation is needed before
making the full change.

For now we can store the checks from the plan directly into the new
state, since the apply walk overwrote the results we had already.
Ensure that empty check results are normalized in state serialization to
prevent unexpected state changes from being written.

Because there is no consistent empty, null and omit_empty usage for
state structs, there's no good way to create a test which will fail
for future additions.
ONly add NoOp changes to the apply graph if they have conditions which
need to be evaluated.
We need to avoid re-writing the state for every NoOp apply. We may
still be evaluating the instance to account for any side-effects in the
condition checks, however the state of the instance has not changes.
Re-writing the state is a non-current operation, which may require
encoding a fairly large instance state and re-serializing the entire
state blob, so it is best avoided if possible.
@jbardin jbardin force-pushed the jbardin/noop-apply-optimizations branch from 3bbf3d7 to efd7715 Compare November 1, 2022 20:18
@jbardin jbardin requested a review from a team as a code owner November 1, 2022 20:18
@jbardin jbardin changed the base branch from jbardin/apply-refresh-plan to main November 1, 2022 20:19
@jbardin jbardin removed the request for review from a team November 1, 2022 20:19
@jbardin
Copy link
Member Author

jbardin commented Nov 1, 2022

The earlier branches were no longer needed for this patch, so I rebased this on main in order to merge it for the release.

@jbardin jbardin merged commit bb68075 into main Nov 1, 2022
@jbardin jbardin deleted the jbardin/noop-apply-optimizations branch November 1, 2022 20:32
@github-actions
Copy link

github-actions bot commented Nov 1, 2022

Reminder for the merging maintainer: if this is a user-visible change, please update the changelog on the appropriate release branch.

@github-actions
Copy link

github-actions bot commented Dec 2, 2022

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
1.3-backport If you add this label to a PR before merging, backport-assistant will open a new PR once merged
Projects
None yet
2 participants