Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix operator shutdown hanging when kvstore is enabled #24979

Merged
merged 3 commits into from
May 1, 2023

Conversation

giorio94
Copy link
Member

Reporting here the message of the last commit, which addresses the core issue, for convenience:

f487647 ("operator: Wait for informers to shut down when stopping") introduced a WaitGroup to wait for the shutdown of all goroutines started by legacy.onStart. When running in kvstore mode, the operator spawns one goroutine to synchronize global services to the kvstore, continuously reading from a channel. Since that channel is never closed, the goroutine does not terminate, preventing the correct shutdown of the operator. This commit fixes it propagating the context, and stopping the goroutine when it is canceled.

The first two commits instead, perform the following modifications to the controller implementation:

  1. Access the controller context while holding the mutex, to avoid reading it while being overwritten on restart.
  2. Do not rerun the DoFunc once the context has been canceled. This caused spurious errors during operator shutdown, since the controllers handling the renew of etcd leases started failing repeatedly due to the context being closed.
Fix operator shutdown hanging when kvstore is enabled

@giorio94 giorio94 added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. area/operator Impacts the cilium-operator component needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Apr 19, 2023
@giorio94 giorio94 requested review from a team as code owners April 19, 2023 13:20
@giorio94
Copy link
Member Author

/test

@giorio94
Copy link
Member Author

ci-multicluster hit known failure #24990

@giorio94
Copy link
Member Author

/ci-multicluster

@giorio94
Copy link
Member Author

test-1.27-net-next egressgw tests failed due to #24835 having been merged, as this one is not rebased.

Copy link
Contributor

@thorn3r thorn3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! appreciate the detailed commit messages ➕

86aa873 ("controller: add new ControllerParam CancelDoFuncOnUpdate")
introduced the possibility of canceling the context of the controller's
DoFunc when the controller is updated, which is then reset. There's an
extremely rare chance that this context is accessed while being
overwritten. Hence, let's grab a local copy while holding the mutex.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
The Controller implementation currently supports the possibility of
specifying a context among the parameters, which is propagated to the
DoFunc. Though, once the context is canceled, the DoFunc is likely to
never succeed again (unless it ignores it). Hence, let's pause the
controller until it is either updated (refreshing the context) or
properly stopped.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
f487647 ("operator: Wait for informers to shut down when stopping")
introduced a WaitGroup to wait for the shutdown of all goroutines
started by `legacy.onStart`. When running in kvstore mode, the operator
spawns one goroutine to synchronize global services to the kvstore,
continuously reading from a channel. Since that channel is never closed,
the goroutine does not terminate, preventing the correct shutdown of the
operator. This commit fixes it propagating the context, and stopping the
goroutine when it is canceled.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the mio/operator-shutdown-kvstore branch from 8751419 to b81e664 Compare April 28, 2023 06:20
@giorio94
Copy link
Member Author

giorio94 commented Apr 28, 2023

/test

Rerunning all tests after rebasing onto main, to fix the test-1.27-net-next failures.

@giorio94
Copy link
Member Author

giorio94 commented Apr 28, 2023

/ci-datapath

A few matrix entries failed while creating the LVH VMs. Rerunning

@giorio94
Copy link
Member Author

giorio94 commented Apr 28, 2023

/test-runtime

Hit know flake #22373

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 28, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.13.3 Apr 28, 2023
@pchaigno pchaigno merged commit 1ca6185 into cilium:main May 1, 2023
56 checks passed
@joamaki joamaki mentioned this pull request May 2, 2023
8 tasks
@joamaki joamaki added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels May 2, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.13 in 1.13.3 May 2, 2023
@julianwiedmann julianwiedmann added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels May 9, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.13 to Backport done to v1.13 in 1.13.3 May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/operator Impacts the cilium-operator component backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.13.3
Backport done to v1.13
Development

Successfully merging this pull request may close these issues.

None yet

5 participants