Don't cancel timeout context after failure and first loop #32

ibuildthecloud · 2022-06-29T16:53:05Z

By canceling the timeoutCtx here it has the side effect that any
cache later in the toWait list will be canceled if it hasn't finished
yet. The side effect of this is that we only really wait for the
first cache and then if any other cache isn't don't we cancel them.

This is the source of errors like

failed to sync schemas: failed to sync cache for /v1, Kind=Secret

You may see these errors in the Rancher cluster agent when Steve is
starting, but you will definitely see these errors when running wtfk8s.

Signed-off-by: Darren Shepherd darren@acorn.io

By canceling the timeoutCtx here it has the side effect that any cache later in the toWait list will be canceled if it hasn't finished yet. The side effect of this is that we only really wait for the first cache and then if any other cache isn't don't we cancel them. This is the source of errors like failed to sync schemas: failed to sync cache for /v1, Kind=Secret You may see these errors in the Rancher cluster agent when Steve is starting, but you will definitely see these errors when running wtfk8s. Signed-off-by: Darren Shepherd <darren@acorn.io>

ibuildthecloud · 2022-06-29T16:54:21Z

There's no need to cancel this context as it's canceled in the defer at the top.

ibuildthecloud · 2022-06-29T16:55:37Z

@kinarashah Could you please review, I can't assign reviewers.

ibuildthecloud · 2022-08-18T07:02:48Z

Can someone please, please, please merge this?

KevinJoiner

Which problem are we trying to solve here

When we have multiple watchers we are not waiting for all caches to sync.
When we have multiple watchers we incorrectly error every-time.

If we are only trying to solve problem 2 would this not cause us to wait for a sync that we do not care about?
I ask because I am not sure what the original intent was for canceling after a successful sync.

ibuildthecloud · 2022-11-23T19:21:46Z

The issue is closer to When we have multiple watchers we incorrectly error every-time.. When we have multiple watchers if one fails, we fail them all current, which is not good. I'm the original author of this code and the original assumption was that basically caches rarely fail to sync and any such error would be transient. That assumption is not true as one persist and easy way to cause a cache to fail syncing is if a custom apiserver is down. So when that happens a lot fails.

superseb requested review from a team August 18, 2022 07:15

KevinJoiner reviewed Aug 18, 2022

View reviewed changes

a-blender requested a review from kinarashah September 8, 2022 16:13

a-blender removed the request for review from a team October 11, 2022 15:37

kinarashah approved these changes Nov 23, 2022

View reviewed changes

kinarashah merged commit d33a7d8 into rancher:master Nov 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't cancel timeout context after failure and first loop #32

Don't cancel timeout context after failure and first loop #32

ibuildthecloud commented Jun 29, 2022

ibuildthecloud commented Jun 29, 2022

ibuildthecloud commented Jun 29, 2022

ibuildthecloud commented Aug 18, 2022

KevinJoiner left a comment

ibuildthecloud commented Nov 23, 2022

Don't cancel timeout context after failure and first loop #32

Don't cancel timeout context after failure and first loop #32

Conversation

ibuildthecloud commented Jun 29, 2022

ibuildthecloud commented Jun 29, 2022

ibuildthecloud commented Jun 29, 2022

ibuildthecloud commented Aug 18, 2022

KevinJoiner left a comment

Choose a reason for hiding this comment

ibuildthecloud commented Nov 23, 2022