Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ block move with annotation #8690

Merged
merged 1 commit into from
Aug 17, 2023
Merged

Conversation

nojnhuh
Copy link
Contributor

@nojnhuh nojnhuh commented May 17, 2023

What this PR does / why we need it:

This PR adds a new well-known clusterctl.cluster.x-k8s.io/block-move annotation that, when defined on any resource to be moved, blocks clusterctl move from recreating any resources in the target cluster. Infrastructure providers can set this annotation ahead of time to make time to do any extra work in reaction to a cluster being paused.

More details on CAPZ's specific use case are here: #8473.

The expected flow is as follows:

  1. Infra controller sets the block-move annotation on any resource(s) in the first reconciliation.
  2. clusterctl move starts.
  3. clusterctl sets Cluster's spec.pause to true, waits for block-move annotation to not be defined on any resource to be moved.
  4. InfraCluster reconciliation invoked based on that change, controller starts work.
  5. InfraCluster controller finishes work, removes block-move.
  6. clusterctl move continues.

A POC implementation can be found here which I've verified with one of the self-hosted e2e tests: main...nojnhuh:cluster-api:wait-pause-docker

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #8473

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/clusterctl Issues or PRs related to clusterctl size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 17, 2023
@ykakarap
Copy link
Contributor

I will take a look at this this week :)

@fabriziopandini
Copy link
Member

I'm not sure the solution proposed here addresses #8473; I have summarized my understanding of the issue in #8473 (comment); if possible, let continue discussion on the issue first.

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 6, 2023

Just updated this with something closer to I think what @fabriziopandini had in mind.

/retitle ✨ block move with annotation

@k8s-ci-robot k8s-ci-robot changed the title ✨ Add is-paused annotation ✨ block move with annotation Jun 6, 2023
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 12, 2023

@ykakarap PTAL if you're still interested in going through this.

@fabriziopandini I added some more thoughts to the issue and updated this PR accordingly.

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 14, 2023

I updated the PR description to match how the current changes work.

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 19, 2023

(reminder to squash)
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 19, 2023
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 19, 2023

/retest

Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking so long to provide feedback.
Overall I think we are really near the finish line, a couple of suggestions from my side but nothing really blocking

Also, could you add the new annotation in https://cluster-api.sigs.k8s.io/reference/labels_and_annotations.html and a note in https://cluster-api.sigs.k8s.io/clusterctl/provider-contract.html#move; might be worth to surface the new feature to other providers in https://main.cluster-api.sigs.k8s.io/developer/providers/migrations/v1.4-to-v1.5.html

cmd/clusterctl/api/v1alpha3/annotations.go Outdated Show resolved Hide resolved
cmd/clusterctl/client/cluster/mover.go Show resolved Hide resolved
}
key := client.ObjectKeyFromObject(obj)

if err := retryWithExponentialBackoff(backoff, func() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is implemented the backoff applies to every CR, thus assuming that we are moving 100 objects, this can result in 100*5sec, which is a lot. Is that the expected behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I had in mind for the time being. I generally wouldn't expect this to ever wait for N resources * timeout for one resource since while clusterctl is waiting for one resource to be paused, the provider will continue doing work out-of-band to pause that and other resources in parallel. Overall I think the total time this waits will trend toward the longest wait for an individual resource in that case.

That same parallelism could be applied to the waits here as well, but that didn't seem worth the additional complexity under the assumption that resources are going to be paused in parallel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got your point, but It is not ideal because ultimately the time the users have to wait can vary in a significant way, however, I can be ok with this approach for the first iteration if we can get a better UX as suggested in other comments.

TL;DR; if (and only if) there is a block on an object, before starting the wait loop we should log "Waiting for the block to be removed on ..." or something similar that makes the users aware of where the move process is blocked at

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my read of the code, it looks like the "resource is not ready to move" error returned here will get logged here which happens before any time.Sleep:

log.V(5).Info("Retrying with backoff", "Cause", err.Error())

Does that cover what you describe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, this is the debug of every retry and it is very noisy; it will also be printed for every object, not only the objects with an actual block move.

So it does not comply with

if (and only if) there is a block on an object, before starting the wait loop we should log "Waiting for the block to be removed on ..." or something similar that makes the users aware of where the move process is blocked at

It just provides debug/trace details on every step of the wait loop when you are already in

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you'd like to see a single log message indicating that at least one object is blocking the move?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obj1 no block move annotation --> no message at all (and possibly skip entirely the wait loop)
obj2 block move annotation --> before starting the wait loop, "Waiting for the block to be removed on ..."
obj3-50 no block move annotation --> no message at all (and possibly skip entirely the wait loop)
obj51 block move annotation --> before starting the wait loop, "Waiting for the block to be removed on ..."
and so on

NOTE: this is consistent with most of the logs in clusterctl, where we inform about the next operation before starting it, which is even more important in this case because otherwise the system will seem to be stuck waiting; in this case the caveat is that according to the spec we should wait only for objects with the annotation, not for all the objects (this will be much easier if you collect collect wait for move during discovery, as I have suggested in another comment on the same topic)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I've updated this to do that. FWIW we need to enter the wait loop no matter what to determine if the annotation exists, but the loop is exited as soon as the annotation doesn't exist (which may be in the first iteration). You already suggested gathering whether the resource is blocked during object discovery which could make this short-circuit earlier, but I'd like to do that as a follow-up if that's still acceptable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to do that as a follow-up if that's still acceptable

this would also be good to track in an issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #9196

cmd/clusterctl/client/cluster/mover.go Outdated Show resolved Hide resolved
@@ -595,6 +610,49 @@ func setClusterClassPause(proxy Proxy, clusterclasses []*node, pause bool, dryRu
return nil
}

func waitReadyForMove(proxy Proxy, nodes []*node, dryRun bool, backoff wait.Backoff) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we can collect wait for move during discovery, then we can make this a no-op for most of the object, and implement a better UX for objects that are actually blocking move by printing "Wating for ... to unblock move" before starting to wait

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that seems better. I can take a look at that as a follow-up.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 3, 2023
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jul 7, 2023

@fabriziopandini Do you have any other thoughts on this?

@fabriziopandini
Copy link
Member

fabriziopandini commented Jul 10, 2023

@nojnhuh if I'm not wrong this point is still open

I got your point, but It is not ideal because ultimately the time the users have to wait can vary in a significant way, however, I can be ok with this approach for the first iteration if we can get a better UX as suggested in other comments.

TL;DR; if (and only if) there is a block on an object, before starting the wait loop we should log "Waiting for the block to be removed on ..." or something similar that makes the users aware of where the move process is blocked at

What I see in the code is instead the opposite; the operation start to wait without giving any info to users (so the user is blind on what is happening), then only when wait is finished, it informs the user that it has finished waiting for an object

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Aug 1, 2023
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Aug 2, 2023

@nojnhuh if I'm not wrong this point is still open

Responded in the thread. I think that was the last piece of outstanding feedback.

@CecileRobertMichon
Copy link
Contributor

/cc @ykakarap @Jont828

Are you both able to review this to help move it forward?

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 15, 2023
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Aug 15, 2023

Need to squash again
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 15, 2023
Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/assign @fabriziopandini

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 15, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 1c1addfcd6f634bd5dd89787829658463bc4da08

@Jont828
Copy link
Contributor

Jont828 commented Aug 16, 2023

/lgtm

@@ -333,6 +335,19 @@ func (o *objectMover) move(graph *objectGraph, toProxy Proxy, mutators ...Resour
return errors.Wrap(err, "error pausing ClusterClasses")
}

log.Info("Waiting for all resources to be ready to move")
// exponential backoff configuration which returns durations for a total time of ~2m.
// Example: 0, 5s, 8s, 11s, 17s, 26s, 38s, 57s, 86s, 128s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, are these the actual numbers it would use or did you pull this from somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jitter makes this inherently random to some degree, but I threw together something to simulate what these numbers could be and this is one example.

docs/book/src/reference/labels_and_annotations.md Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 16, 2023
@fabriziopandini
Copy link
Member

/test pull-cluster-api-e2e-full-main

@fabriziopandini
Copy link
Member

/lgtm
/approve

@nojnhuh it would be great to get #9196 implemented before 1.6.0 is cut, because the current implementation is adding an unnecessary extra get for all the objects + the user can wait for x times the timeout for waitMove.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 17, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: ef774513a8b0bd56b6b039d4ad1163fd56f6abf5

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 17, 2023
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Aug 17, 2023

@nojnhuh it would be great to get #9196 implemented before 1.6.0 is cut, because the current implementation is adding an unnecessary extra get for all the objects + the user can wait for x times the timeout for waitMove.

Sounds good, I'll get going on that.

Also squashed:
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 17, 2023
@k8s-ci-robot k8s-ci-robot merged commit 390549a into kubernetes-sigs:main Aug 17, 2023
18 of 20 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.6 milestone Aug 17, 2023
@nojnhuh nojnhuh deleted the wait-pause branch August 17, 2023 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/clusterctl Issues or PRs related to clusterctl cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
7 participants