🌱 Source should retry to get informers until timeout expires #1678

vincepri · 2021-09-29T17:58:36Z

This changeset adds the ability for a Manager to not fail immediately if
a wait.Backoff parameter is given as RunnableRetryBackoff in Options.

Currently, if a runnable fails to run the Start operation is never
retried which could cause the manager and all webhooks to stop and the
deployment to go into CrashLoopBackoff. Given the eventual consistency
of controllers and managers cooperating with other controllers or the
api-server, allow some sort of backoff by trying to start runnables
a number of times before giving up.

Signed-off-by: Vince Prignano vincepri@vmware.com

vincepri · 2021-09-29T17:58:55Z

/assign @sbueringer @alvaroaleman @fabriziopandini

alvaroaleman · 2021-09-29T18:24:36Z

This seems like a bit of a niche usecase. What is the advantage of this over implementing the retrying inside your runabbles Start()?

vincepri · 2021-09-29T18:56:07Z

@alvaroaleman This issue is actually coming from the fact that when you have a manger that bundles caches, webhooks, and controllers you might end up in a situation where the caches can't be populated yet because the api-server is reading the new CRDs (new API versions for example). The controller then would crash and cause also the webhooks to go along with it, which can cause reading or operating on objects fail in a chain reaction.

The overall system is eventually consistent and this changeset introduces a way to optionally retry a runnable in case there are errors. If those errors persist, the manager should definitely crash at that point.

alvaroaleman · 2021-09-29T19:32:32Z

So the scenario you are describing is if the manager has a conversion webhook for a version the cache wants to read (because some controller uses it)? Shouldn't we simply start conversion webhooks first to fix that?

vincepri · 2021-09-29T19:45:12Z

So the scenario you are describing is if the manager has a conversion webhook for a version the cache wants to read (because some controller uses it)? Shouldn't we simply start conversion webhooks first to fix that?

Not just conversion, this applies to all webhooks. Yes, we're start webhooks first today. Consider this scenario:

There are two Managers, both carrying webhooks (any kind is fine).
Webhooks are registered as required.
Manager A watches and needs to know about some CRDs in Manager B.
Manager A, upon startup, creates a new cache against CRDs from Manager B.
Manager B starts up, but for some reason it can't yet read or discover its own CRDs (just installed, maybe the api-server is behind, etc).
Manager B crashes, along the webhooks.
The API server, to populate its own caches, won't be able to reach Manager B's webhooks.
Any client asking for Manager B's CRDs fails.
Manager A won't be able to start its caches and it'll crash.

The problem becomes even more complicated if the two controllers watch each other's CRDs, in which case there is never enough time for the two controller to stay up and running enough to allow a proper cache start.

We've seen this problem in Cluster API during a version upgrade, which led us here. Happy to go over the use case a bit more if that helps.

fabriziopandini · 2021-09-30T12:44:00Z

lgtm for me
thanks for addressing this Vince!

alvaroaleman · 2021-09-30T13:13:30Z

Not just conversion, this applies to all webhooks

I don't think so, because mutating and validating webhooks only cover mutating calls, hence they can not be in the path of a read-only request.

Manager A watches and needs to know about some CRDs in Manager B.
Manager B starts up, but for some reason it can't yet read or discover its own CRDs (just installed, maybe the api-server is behind, etc).

This should result in minimal delay or not? CRD establishing takes like what, two seconds?

The problem becomes even more complicated if the two controllers watch each other's CRDs, in which case there is never enough time for the two controller to stay up and running enough to allow a proper cache start.

By "each others CRD" you mean a crd version that is provided through a conversion webhook in the other controller? Don't we give the cache two minutes to get up and ready?

vincepri · 2021-09-30T14:33:18Z

I don't think so, because mutating and validating webhooks only cover mutating calls, hence they can not be in the path of a read-only request.

Fair, the error we've seen was mostly related to conversion kubernetes-sigs/cluster-api#5327

This should result in minimal delay or not? CRD establishing takes like what, two seconds?

It really depends, api-server load is definitely something that can affect, but from what I've been able to test for example changing the webhook service can take a bit.

By "each others CRD" you mean a crd version that is provided through a conversion webhook in the other controller? Don't we give the cache two minutes to get up and ready?

No, in this case I mean two managers controlling two independent set of CRDs but also watching each other's CRDs to act on changes.

Signed-off-by: Vince Prignano <vincepri@vmware.com>

k8s-ci-robot · 2021-10-04T20:30:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alvaroaleman,vincepri]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vincepri · 2021-10-05T04:43:04Z

/cherrypick release-0.10

k8s-infra-cherrypick-robot · 2021-10-05T04:43:43Z

@vincepri: new pull request created: #1682

In response to this:

/cherrypick release-0.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 29, 2021

k8s-ci-robot requested review from alvaroaleman and gerred September 29, 2021 17:58

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 29, 2021

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 29, 2021

k8s-ci-robot assigned alvaroaleman, fabriziopandini and sbueringer Sep 29, 2021

vincepri force-pushed the runnables-retry branch from 851eccd to 6104588 Compare September 29, 2021 18:57

vincepri force-pushed the runnables-retry branch from 6104588 to b627bce Compare October 4, 2021 18:32

vincepri changed the title ~~✨ Manager should support retrying to start runnables with backoff~~ 🌱 Source should retry to get informers until timeout expires Oct 4, 2021

alvaroaleman approved these changes Oct 4, 2021

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 4, 2021

🌱 Source should retry to get informers until timeout expires

4af39e6

Signed-off-by: Vince Prignano <vincepri@vmware.com>

vincepri force-pushed the runnables-retry branch from b627bce to 4af39e6 Compare October 4, 2021 18:57

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 4, 2021

alvaroaleman approved these changes Oct 4, 2021

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 4, 2021

k8s-ci-robot merged commit b1efff6 into kubernetes-sigs:master Oct 4, 2021

k8s-ci-robot added this to the v0.10.x milestone Oct 4, 2021

k8s-infra-cherrypick-robot mentioned this pull request Oct 5, 2021

🌱 Source should retry to get informers until timeout expires #1682

Merged

timebertt mentioned this pull request Oct 12, 2021

Upgrade k8s.io/* to v0.22, sigs.k8s.io/controller-runtime to v0.10 gardener/gardener#4772

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🌱 Source should retry to get informers until timeout expires #1678

🌱 Source should retry to get informers until timeout expires #1678

vincepri commented Sep 29, 2021

vincepri commented Sep 29, 2021

alvaroaleman commented Sep 29, 2021

vincepri commented Sep 29, 2021

alvaroaleman commented Sep 29, 2021

vincepri commented Sep 29, 2021 •

edited

fabriziopandini commented Sep 30, 2021

alvaroaleman commented Sep 30, 2021

vincepri commented Sep 30, 2021

k8s-ci-robot commented Oct 4, 2021

vincepri commented Oct 5, 2021

k8s-infra-cherrypick-robot commented Oct 5, 2021

🌱 Source should retry to get informers until timeout expires #1678

🌱 Source should retry to get informers until timeout expires #1678

Conversation

vincepri commented Sep 29, 2021

vincepri commented Sep 29, 2021

alvaroaleman commented Sep 29, 2021

vincepri commented Sep 29, 2021

alvaroaleman commented Sep 29, 2021

vincepri commented Sep 29, 2021 • edited

fabriziopandini commented Sep 30, 2021

alvaroaleman commented Sep 30, 2021

vincepri commented Sep 30, 2021

k8s-ci-robot commented Oct 4, 2021

vincepri commented Oct 5, 2021

k8s-infra-cherrypick-robot commented Oct 5, 2021

vincepri commented Sep 29, 2021 •

edited