Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 Source should retry to get informers until timeout expires #1678

Merged
merged 1 commit into from Oct 4, 2021

Conversation

vincepri
Copy link
Member

This changeset adds the ability for a Manager to not fail immediately if
a wait.Backoff parameter is given as RunnableRetryBackoff in Options.

Currently, if a runnable fails to run the Start operation is never
retried which could cause the manager and all webhooks to stop and the
deployment to go into CrashLoopBackoff. Given the eventual consistency
of controllers and managers cooperating with other controllers or the
api-server, allow some sort of backoff by trying to start runnables
a number of times before giving up.

Signed-off-by: Vince Prignano vincepri@vmware.com

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 29, 2021
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 29, 2021
@vincepri
Copy link
Member Author

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 29, 2021
@alvaroaleman
Copy link
Member

This seems like a bit of a niche usecase. What is the advantage of this over implementing the retrying inside your runabbles Start()?

@vincepri
Copy link
Member Author

@alvaroaleman This issue is actually coming from the fact that when you have a manger that bundles caches, webhooks, and controllers you might end up in a situation where the caches can't be populated yet because the api-server is reading the new CRDs (new API versions for example). The controller then would crash and cause also the webhooks to go along with it, which can cause reading or operating on objects fail in a chain reaction.

The overall system is eventually consistent and this changeset introduces a way to optionally retry a runnable in case there are errors. If those errors persist, the manager should definitely crash at that point.

@alvaroaleman
Copy link
Member

So the scenario you are describing is if the manager has a conversion webhook for a version the cache wants to read (because some controller uses it)? Shouldn't we simply start conversion webhooks first to fix that?

@vincepri
Copy link
Member Author

vincepri commented Sep 29, 2021

So the scenario you are describing is if the manager has a conversion webhook for a version the cache wants to read (because some controller uses it)? Shouldn't we simply start conversion webhooks first to fix that?

Not just conversion, this applies to all webhooks. Yes, we're start webhooks first today. Consider this scenario:

  • There are two Managers, both carrying webhooks (any kind is fine).
  • Webhooks are registered as required.
  • Manager A watches and needs to know about some CRDs in Manager B.
  • Manager A, upon startup, creates a new cache against CRDs from Manager B.
  • Manager B starts up, but for some reason it can't yet read or discover its own CRDs (just installed, maybe the api-server is behind, etc).
  • Manager B crashes, along the webhooks.
  • The API server, to populate its own caches, won't be able to reach Manager B's webhooks.
  • Any client asking for Manager B's CRDs fails.
  • Manager A won't be able to start its caches and it'll crash.

The problem becomes even more complicated if the two controllers watch each other's CRDs, in which case there is never enough time for the two controller to stay up and running enough to allow a proper cache start.

We've seen this problem in Cluster API during a version upgrade, which led us here. Happy to go over the use case a bit more if that helps.

@fabriziopandini
Copy link
Member

lgtm for me
thanks for addressing this Vince!

@alvaroaleman
Copy link
Member

Not just conversion, this applies to all webhooks

I don't think so, because mutating and validating webhooks only cover mutating calls, hence they can not be in the path of a read-only request.

Manager A watches and needs to know about some CRDs in Manager B.
Manager B starts up, but for some reason it can't yet read or discover its own CRDs (just installed, maybe the api-server is behind, etc).

This should result in minimal delay or not? CRD establishing takes like what, two seconds?

The problem becomes even more complicated if the two controllers watch each other's CRDs, in which case there is never enough time for the two controller to stay up and running enough to allow a proper cache start.

By "each others CRD" you mean a crd version that is provided through a conversion webhook in the other controller? Don't we give the cache two minutes to get up and ready?

@vincepri
Copy link
Member Author

I don't think so, because mutating and validating webhooks only cover mutating calls, hence they can not be in the path of a read-only request.

Fair, the error we've seen was mostly related to conversion kubernetes-sigs/cluster-api#5327

This should result in minimal delay or not? CRD establishing takes like what, two seconds?

It really depends, api-server load is definitely something that can affect, but from what I've been able to test for example changing the webhook service can take a bit.

By "each others CRD" you mean a crd version that is provided through a conversion webhook in the other controller? Don't we give the cache two minutes to get up and ready?

No, in this case I mean two managers controlling two independent set of CRDs but also watching each other's CRDs to act on changes.

@vincepri vincepri changed the title ✨ Manager should support retrying to start runnables with backoff 🌱 Source should retry to get informers until timeout expires Oct 4, 2021
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 4, 2021
Signed-off-by: Vince Prignano <vincepri@vmware.com>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 4, 2021
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 4, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [alvaroaleman,vincepri]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit b1efff6 into kubernetes-sigs:master Oct 4, 2021
@k8s-ci-robot k8s-ci-robot added this to the v0.10.x milestone Oct 4, 2021
@vincepri
Copy link
Member Author

vincepri commented Oct 5, 2021

/cherrypick release-0.10

@k8s-infra-cherrypick-robot

@vincepri: new pull request created: #1682

In response to this:

/cherrypick release-0.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants