Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't stop informer delivery on error #58394

Merged
merged 1 commit into from Jan 23, 2018

Conversation

deads2k
Copy link
Contributor

@deads2k deads2k commented Jan 17, 2018

If an informer delivery fails today, we stop delivering to it entirely. The pull updates the code to skip that particular notification, delay, and continue delivery with the next time.

/assign derekwaynecarr
/assign ncdc
/assign ash2k

@derekwaynecarr This would change the "the controller isn't doing anything?!" to "the controller missed my (individual) resource!"

NONE

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 17, 2018
p.handler.OnDelete(notification.oldObj)
default:
utilruntime.HandleError(fmt.Errorf("unrecognized notification: %#v", next))
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop runs until p.nextCh is closed which means normal graceful shutdown. So, instead of adding stopCh as a parameter to the func, it is possible to make a channel right before wait.Until() is invoked and close() it right after the for loop. Does it make sense?

@ash2k
Copy link
Member

ash2k commented Jan 17, 2018

@wryun, I think we saw a case where a panic in informer handler did not terminate the program but left it in a zombie state? Will this help (I don't remember the details)?

@wryun
Copy link

wryun commented Jan 18, 2018

@ash2k yes, it will improve matters (we were panicking in the handler - this will at least stop it stalling after this). Without properly understanding the implications, though, I think I'd still prefer it if the whole process shut-down for our particular case... since the cache is probably now a mess, and subsequent processing might be incorrect (I think?).

@ash2k
Copy link
Member

ash2k commented Jan 18, 2018

@wryun Panicking handler does not spoil the cache. Also other handlers for the same event are executed (concurrently) regardless of this.

func (p *processorListener) run(stopCh <-chan struct{}) {
// this call blocks until the channel is closed. When a panic happens during the notification
// we will catch it, **the offending item will be skipped!**, and after a short delay (one second)
// the next notification will be attempted. This is usually better than the alternative of never
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... if there's a more permanent problem with delivering, this will keep failing every second, why not using some backoff here as well?

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 18, 2018
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jan 18, 2018
@deads2k
Copy link
Contributor Author

deads2k commented Jan 18, 2018

Comments addressed. Take a look at the new impl

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 18, 2018
// we will catch it, **the offending item will be skipped!**, and after a short delay (one second)
// the next notification will be attempted. This is usually better than the alternative of never
// delivering again.
wait.ExponentialBackoff(retry.DefaultRetry, func() (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// ExponentialBackoff repeats a condition check with exponential backoff.
//
// It checks the condition up to Steps times, increasing the wait by multiplying
// the previous duration by Factor.
//
// If Jitter is greater than zero, a random amount of each duration is added
// (between duration and duration*(1+jitter)).
//
// If the condition never returns true, ErrWaitTimeout is returned. All other
// errors terminate immediately.

So the current code will skip 5 (value from retry.DefaultRetry) errors with exponential delay and then stop processing forever. Not what we want, right?
Maybe in case of an error an event should be emitted? Is there class of events to notify cluster administrator that something is not right?
I think ideally the backoff should be capped exponential but should reduce with time back to the initial delay if there is no errors. I personally would be happy with a fixed delay of a few seconds. Maybe bigger fixed delay is ok because cluster slowdown will let the operator know that something is not working properly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'll teach me to skip reading the godoc on the function. I could wrap it in a second loop and does a fairly long pause (minute?) and then starts this over again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@deads2k
Copy link
Contributor Author

deads2k commented Jan 22, 2018

now with more retries.

@deads2k
Copy link
Contributor Author

deads2k commented Jan 22, 2018

/retest

1 similar comment
@deads2k
Copy link
Contributor Author

deads2k commented Jan 22, 2018

/retest

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 22, 2018
Copy link
Member

@ash2k ash2k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

p.s. I've recently learned a github trick - if you append ?w=1 to the url, it ignores whitespace changes in the diff. Much easier to review PRs like this one.

@deads2k deads2k added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 22, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: ash2k, deads2k

No associated issue. Requirement bypassed by manually added approval.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@deads2k
Copy link
Contributor Author

deads2k commented Jan 22, 2018

just a generated godeps.json file. Applying approval manually.

@sttts in case he gets in tomorrow in time for paperwork.

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@deads2k
Copy link
Contributor Author

deads2k commented Jan 22, 2018

/retest

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 58412, 56132, 58506, 58542, 58394). If you want to cherry-pick this change to another branch, please follow the instructions here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants