don't stop informer delivery on error #58394

deads2k · 2018-01-17T14:59:55Z

If an informer delivery fails today, we stop delivering to it entirely. The pull updates the code to skip that particular notification, delay, and continue delivery with the next time.

/assign derekwaynecarr
/assign ncdc
/assign ash2k

@derekwaynecarr This would change the "the controller isn't doing anything?!" to "the controller missed my (individual) resource!"

NONE

ash2k · 2018-01-17T22:14:22Z

staging/src/k8s.io/client-go/tools/cache/shared_informer.go

+				p.handler.OnDelete(notification.oldObj)
+			default:
+				utilruntime.HandleError(fmt.Errorf("unrecognized notification: %#v", next))
+			}
 		}


This loop runs until p.nextCh is closed which means normal graceful shutdown. So, instead of adding stopCh as a parameter to the func, it is possible to make a channel right before wait.Until() is invoked and close() it right after the for loop. Does it make sense?

ash2k · 2018-01-17T22:17:42Z

@wryun, I think we saw a case where a panic in informer handler did not terminate the program but left it in a zombie state? Will this help (I don't remember the details)?

wryun · 2018-01-18T01:23:16Z

@ash2k yes, it will improve matters (we were panicking in the handler - this will at least stop it stalling after this). Without properly understanding the implications, though, I think I'd still prefer it if the whole process shut-down for our particular case... since the cache is probably now a mess, and subsequent processing might be incorrect (I think?).

ash2k · 2018-01-18T01:59:10Z

@wryun Panicking handler does not spoil the cache. Also other handlers for the same event are executed (concurrently) regardless of this.

soltysh · 2018-01-18T16:11:04Z

staging/src/k8s.io/client-go/tools/cache/shared_informer.go

+func (p *processorListener) run(stopCh <-chan struct{}) {
+	// this call blocks until the channel is closed.  When a panic happens during the notification
+	// we will catch it, **the offending item will be skipped!**, and after a short delay (one second)
+	// the next notification will be attempted.  This is usually better than the alternative of never


Hmmm... if there's a more permanent problem with delivering, this will keep failing every second, why not using some backoff here as well?

k8s-ci-robot · 2018-01-18T18:07:00Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: helpdesk@rt.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

deads2k · 2018-01-18T18:08:18Z

Comments addressed. Take a look at the new impl

ash2k · 2018-01-19T09:05:52Z

staging/src/k8s.io/client-go/tools/cache/shared_informer.go

+	// we will catch it, **the offending item will be skipped!**, and after a short delay (one second)
+	// the next notification will be attempted.  This is usually better than the alternative of never
+	// delivering again.
+	wait.ExponentialBackoff(retry.DefaultRetry, func() (bool, error) {


// ExponentialBackoff repeats a condition check with exponential backoff. // // It checks the condition up to Steps times, increasing the wait by multiplying // the previous duration by Factor. // // If Jitter is greater than zero, a random amount of each duration is added // (between duration and duration*(1+jitter)). // // If the condition never returns true, ErrWaitTimeout is returned. All other // errors terminate immediately.

So the current code will skip 5 (value from retry.DefaultRetry) errors with exponential delay and then stop processing forever. Not what we want, right?
Maybe in case of an error an event should be emitted? Is there class of events to notify cluster administrator that something is not right?
I think ideally the backoff should be capped exponential but should reduce with time back to the initial delay if there is no errors. I personally would be happy with a fixed delay of a few seconds. Maybe bigger fixed delay is ok because cluster slowdown will let the operator know that something is not working properly?

That'll teach me to skip reading the godoc on the function. I could wrap it in a second loop and does a fairly long pause (minute?) and then starts this over again?

deads2k · 2018-01-22T14:56:08Z

now with more retries.

deads2k · 2018-01-22T15:56:59Z

/retest

deads2k · 2018-01-22T19:35:45Z

/retest

ash2k

/lgtm

p.s. I've recently learned a github trick - if you append ?w=1 to the url, it ignores whitespace changes in the diff. Much easier to review PRs like this one.

k8s-ci-robot · 2018-01-22T20:56:59Z

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: ash2k, deads2k

No associated issue. Requirement bypassed by manually added approval.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~staging/src/k8s.io/apiextensions-apiserver/OWNERS~~ [deads2k]
~~staging/src/k8s.io/apiserver/OWNERS~~ [deads2k]
~~staging/src/k8s.io/client-go/tools/cache/OWNERS~~ [deads2k]
~~staging/src/k8s.io/kube-aggregator/OWNERS~~ [deads2k]
~~staging/src/k8s.io/sample-apiserver/OWNERS~~ [deads2k]
staging/src/k8s.io/sample-controller/OWNERS

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

deads2k · 2018-01-22T20:57:16Z

just a generated godeps.json file. Applying approval manually.

@sttts in case he gets in tomorrow in time for paperwork.

k8s-github-robot · 2018-01-22T21:27:49Z

/test all [submit-queue is verifying that this PR is safe to merge]

deads2k · 2018-01-22T23:34:20Z

/retest

fejta-bot · 2018-01-23T03:44:23Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

k8s-github-robot · 2018-01-23T06:57:45Z

Automatic merge from submit-queue (batch tested with PRs 58412, 56132, 58506, 58542, 58394). If you want to cherry-pick this change to another branch, please follow the instructions here.

k8s-ci-robot assigned ash2k, derekwaynecarr and ncdc Jan 17, 2018

ash2k reviewed Jan 17, 2018

View reviewed changes

soltysh reviewed Jan 18, 2018

View reviewed changes

deads2k force-pushed the controller-08-redeliver branch from 39c5ee0 to 0a6bdbd Compare January 18, 2018 18:06

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 18, 2018

deads2k force-pushed the controller-08-redeliver branch from 0a6bdbd to a7486f6 Compare January 18, 2018 18:07

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jan 18, 2018

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 18, 2018

deads2k force-pushed the controller-08-redeliver branch from a7486f6 to 5f7b3fd Compare January 18, 2018 19:20

ash2k reviewed Jan 19, 2018

View reviewed changes

don't stop informer delivery on error

2fa93da

deads2k force-pushed the controller-08-redeliver branch from 5f7b3fd to 2fa93da Compare January 22, 2018 14:55

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 22, 2018

ash2k approved these changes Jan 22, 2018

View reviewed changes

deads2k added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 22, 2018

k8s-github-robot merged commit 71426ba into kubernetes:master Jan 23, 2018

ash2k mentioned this pull request Apr 6, 2018

Panics within cache.IndexerInformer's OnDelete watch handlers don't propagate up kubernetes/client-go#395

Closed

deads2k deleted the controller-08-redeliver branch July 3, 2018 18:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

don't stop informer delivery on error #58394

don't stop informer delivery on error #58394

deads2k commented Jan 17, 2018 •

edited

ash2k Jan 17, 2018

ash2k commented Jan 17, 2018

wryun commented Jan 18, 2018 •

edited

ash2k commented Jan 18, 2018

soltysh Jan 18, 2018

k8s-ci-robot commented Jan 18, 2018

deads2k commented Jan 18, 2018

ash2k Jan 19, 2018

deads2k Jan 19, 2018

ash2k Jan 19, 2018

deads2k commented Jan 22, 2018

deads2k commented Jan 22, 2018

deads2k commented Jan 22, 2018

ash2k left a comment

k8s-ci-robot commented Jan 22, 2018

deads2k commented Jan 22, 2018

k8s-github-robot commented Jan 22, 2018

deads2k commented Jan 22, 2018

fejta-bot commented Jan 23, 2018

k8s-github-robot commented Jan 23, 2018

don't stop informer delivery on error #58394

don't stop informer delivery on error #58394

Conversation

deads2k commented Jan 17, 2018 • edited

ash2k Jan 17, 2018

Choose a reason for hiding this comment

ash2k commented Jan 17, 2018

wryun commented Jan 18, 2018 • edited

ash2k commented Jan 18, 2018

soltysh Jan 18, 2018

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 18, 2018

deads2k commented Jan 18, 2018

ash2k Jan 19, 2018

Choose a reason for hiding this comment

deads2k Jan 19, 2018

Choose a reason for hiding this comment

ash2k Jan 19, 2018

Choose a reason for hiding this comment

deads2k commented Jan 22, 2018

deads2k commented Jan 22, 2018

deads2k commented Jan 22, 2018

ash2k left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 22, 2018

deads2k commented Jan 22, 2018

k8s-github-robot commented Jan 22, 2018

deads2k commented Jan 22, 2018

fejta-bot commented Jan 23, 2018

k8s-github-robot commented Jan 23, 2018

deads2k commented Jan 17, 2018 •

edited

wryun commented Jan 18, 2018 •

edited