Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Quiet context.Canceled errors during shutdown #2745

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cbandy
Copy link
Contributor

@cbandy cbandy commented Mar 30, 2024

Runnable implementations that return ctx.Err() cause a spurious "error received after stop" log message.

Fixes #1927

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 30, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cbandy
Once this PR has been reviewed and has the lgtm label, please assign fillzpp for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 30, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @cbandy. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Mar 30, 2024
Copy link
Member

@alvaroaleman alvaroaleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test, please?
/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 30, 2024
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 30, 2024
@cbandy
Copy link
Contributor Author

cbandy commented Mar 30, 2024

@alvaroaleman Thanks for taking a look. Test added in an amended commit.

@alvaroaleman
Copy link
Member

/retest


logs := []string{}
options.Logger = funcr.NewJSON(func(object string) {
logs = append(logs, object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a datarace which is why the tests are failing. You need to protect logs with a mutex or some such

Copy link
Contributor Author

@cbandy cbandy Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A mutex here solved a race among append calls, but it revealed a race between this line and the assertion on L1091 during the LeaderElection suite. My guess is that this goroutine is to blame:

// Start the leader election and all required runnables.
{
ctx, cancel := context.WithCancel(context.Background())
cm.leaderElectionCancel = cancel
go func() {

😞 I added a time.Sleep for that suite for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea is that currently with leaderelection engageStopProcedure is still writing the the logger while the test is asserting on the logs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess would be because the RunnableFunc is not a LeaderElectionRunnable. So it won't wait for leaderelection to close start

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay reproduced locally. ^^ doesn't make sense as everything that doesn't implement LeaderElectionRunnable is treated like it needs leader election

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason why it only occurs with leader election is because in that case we get one more log message:
image
(the other log messages are all produced "synchronously" within engageStopProcedure)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbandy @alvaroaleman If that makes sense to you both, let's add a godoc comment before the sleep and I think we're good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 The fact that Start aggregates its err with stopErr makes me think it expects engageStopProcedure to return the outcome of stopping. Why does it inject a channel at all?

Should engageStopProcedure wait for the errChan-draining goroutine to finish?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 The fact that Start aggregates its err with stopErr makes me think it expects engageStopProcedure to return the outcome of stopping. Why does it inject a channel at all?

Not sure. I think the goal of this case case err := <-cm.errChan: in engageStopProcedure is mostly to avoid running into deadlocks because the errChan blocks.

Should engageStopProcedure wait for the errChan-draining goroutine to finish?

With the way it currently works it can't because this defer (defer close(stopComplete)) in Start is called too late.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could wait in the Start func that the errChan-draining goroutine also finishes.

It would be a bit nicer, but I'm not sure if I want to introduce even more complexity into all of this. It's already pretty had to reason about it.

Runnable implementations that return ctx.Err() cause a spurious
"error received after stop" log message.
Comment on lines -488 to +489
case err, ok := <-cm.errChan:
if ok {
case err := <-cm.errChan:
if !errors.Is(err, context.Canceled) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep checking the ok value here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any calls to close errChan, so ok is always true (today).

Also, in my experience, reading from a closed channel in select is for some kind of control flow (break or return or so), or the channel is assigned nil so that it is not selected again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any calls to close errChan, so ok is always true (today).

Sounds okay to me. I guess if we ever start closing the channel we have to check all usages of the channel anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: "error received after stop ..." is logged for context cancellation errors
5 participants