Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't cancel timeout context after failure and first loop #32

Merged
merged 1 commit into from
Nov 23, 2022

Conversation

ibuildthecloud
Copy link
Contributor

By canceling the timeoutCtx here it has the side effect that any
cache later in the toWait list will be canceled if it hasn't finished
yet. The side effect of this is that we only really wait for the
first cache and then if any other cache isn't don't we cancel them.

This is the source of errors like

failed to sync schemas: failed to sync cache for /v1, Kind=Secret

You may see these errors in the Rancher cluster agent when Steve is
starting, but you will definitely see these errors when running wtfk8s.

Signed-off-by: Darren Shepherd darren@acorn.io

By canceling the timeoutCtx here it has the side effect that any
cache later in the toWait list will be canceled if it hasn't finished
yet.  The side effect of this is that we only really wait for the
first cache and then if any other cache isn't don't we cancel them.

This is the source of errors like

failed to sync schemas: failed to sync cache for /v1, Kind=Secret

You may see these errors in the Rancher cluster agent when Steve is
starting, but you will definitely see these errors when running wtfk8s.

Signed-off-by: Darren Shepherd <darren@acorn.io>
@ibuildthecloud
Copy link
Contributor Author

There's no need to cancel this context as it's canceled in the defer at the top.

@ibuildthecloud
Copy link
Contributor Author

@kinarashah Could you please review, I can't assign reviewers.

@ibuildthecloud
Copy link
Contributor Author

Can someone please, please, please merge this?

@superseb superseb requested review from a team August 18, 2022 07:15
Copy link
Contributor

@KevinJoiner KevinJoiner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which problem are we trying to solve here

  1. When we have multiple watchers we are not waiting for all caches to sync.
  2. When we have multiple watchers we incorrectly error every-time.

If we are only trying to solve problem 2 would this not cause us to wait for a sync that we do not care about?
I ask because I am not sure what the original intent was for canceling after a successful sync.

@a-blender a-blender requested a review from kinarashah September 8, 2022 16:13
@a-blender a-blender removed the request for review from a team October 11, 2022 15:37
@ibuildthecloud
Copy link
Contributor Author

The issue is closer to When we have multiple watchers we incorrectly error every-time.. When we have multiple watchers if one fails, we fail them all current, which is not good. I'm the original author of this code and the original assumption was that basically caches rarely fail to sync and any such error would be transient. That assumption is not true as one persist and easy way to cause a cache to fail syncing is if a custom apiserver is down. So when that happens a lot fails.

@kinarashah kinarashah merged commit d33a7d8 into rancher:master Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants