-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix operator shutdown hanging when kvstore is enabled #24979
Conversation
/test |
ci-multicluster hit known failure #24990 |
/ci-multicluster |
test-1.27-net-next egressgw tests failed due to #24835 having been merged, as this one is not rebased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great! appreciate the detailed commit messages ➕
86aa873 ("controller: add new ControllerParam CancelDoFuncOnUpdate") introduced the possibility of canceling the context of the controller's DoFunc when the controller is updated, which is then reset. There's an extremely rare chance that this context is accessed while being overwritten. Hence, let's grab a local copy while holding the mutex. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
The Controller implementation currently supports the possibility of specifying a context among the parameters, which is propagated to the DoFunc. Though, once the context is canceled, the DoFunc is likely to never succeed again (unless it ignores it). Hence, let's pause the controller until it is either updated (refreshing the context) or properly stopped. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
f487647 ("operator: Wait for informers to shut down when stopping") introduced a WaitGroup to wait for the shutdown of all goroutines started by `legacy.onStart`. When running in kvstore mode, the operator spawns one goroutine to synchronize global services to the kvstore, continuously reading from a channel. Since that channel is never closed, the goroutine does not terminate, preventing the correct shutdown of the operator. This commit fixes it propagating the context, and stopping the goroutine when it is canceled. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
8751419
to
b81e664
Compare
/test Rerunning all tests after rebasing onto main, to fix the test-1.27-net-next failures. |
/ci-datapath A few matrix entries failed while creating the LVH VMs. Rerunning |
/test-runtime Hit know flake #22373 |
Reporting here the message of the last commit, which addresses the core issue, for convenience:
The first two commits instead, perform the following modifications to the controller implementation: