-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
map: fix reconciliation failure caused by out of sync errors number #26742
Conversation
/test |
@giorio94 Thanks for the patch! Would you be able to craft a regression test for this? This code may undergo significant changes throughout 1.15/1.16. |
Cached maps come with a controller that retries Update/Delete operations in case an error occurred while synchronizing the given entry with the kernel. Currently, it relies on the `outstandingErrors` counter to determine whether reconciliation is necessary, as well as if any entry still needs to be processed in a subsequent operation. Yet, it is possible that this counter gets out of sync, in particular in case the given error is resolved automatically when a subsequent operation acting on the same key succeed. If this happens, the reconciliation function will never complete successfully, as there will continue to be an error that cannot be resolved (as it no longer exists). The given controller will this continue failing forever until the agent gets restarted. Let's fix this inferring the number of outstanding errors from the cache itself. The `outstandingErrors` variable is preserved (although converted to a boolean) to avoid iterating over the full cache in case it is known that no error occurred. Still, if this flag gets out of sync, the only consequence will be that the cache is iterated once to determine that there's actually no failure, ensuring that the reconciliation logic still converges properly. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
8fe6a4c
to
ed7ce3a
Compare
I've added a unit test to assert that the error resolver works properly (and succeeds) both in case an element needs to be synchronized, and in case the issue already resolved automatically. The second test consistently fails without the patch introduced in this commit. @ti-mo PTAL. |
/test |
1 similar comment
/test |
Hey @ti-mo, would you have time to take another look at this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super familiar with this code in particular, but the approach and code seem reasonable to me. I don't want to hold this up any longer, thanks for adding the test!
Cached maps come with a controller that retries Update/Delete operations in case an error occurred while synchronizing the given entry with the kernel. Currently, it relies on the
outstandingErrors
counter to determine whether reconciliation is necessary, as well as if any entry still needs to be processed in a subsequent operation.Yet, it is possible that this counter gets out of sync, in particular in case the given error is resolved automatically when a subsequent operation acting on the same key succeed. If this happens, the reconciliation function will never complete successfully, as there will continue to be an error that cannot be resolved (as it no longer exists). The given controller will this continue failing forever until the agent gets restarted.
Let's fix this inferring the number of outstanding errors from the cache itself. The
outstandingErrors
variable is preserved (although converted to a boolean) to avoid iterating over the full cache in case it is known that no error occurred. Still, if this flag gets out of sync, the only consequence will be that the cache is iterated once to determine that there's actually no failure, ensuring that the reconciliation logic still converges properly.