Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map: fix reconciliation failure caused by out of sync errors number #26742

Merged
merged 1 commit into from
Aug 16, 2023

Conversation

giorio94
Copy link
Member

Cached maps come with a controller that retries Update/Delete operations in case an error occurred while synchronizing the given entry with the kernel. Currently, it relies on the outstandingErrors counter to determine whether reconciliation is necessary, as well as if any entry still needs to be processed in a subsequent operation.

Yet, it is possible that this counter gets out of sync, in particular in case the given error is resolved automatically when a subsequent operation acting on the same key succeed. If this happens, the reconciliation function will never complete successfully, as there will continue to be an error that cannot be resolved (as it no longer exists). The given controller will this continue failing forever until the agent gets restarted.

Let's fix this inferring the number of outstanding errors from the cache itself. The outstandingErrors variable is preserved (although converted to a boolean) to avoid iterating over the full cache in case it is known that no error occurred. Still, if this flag gets out of sync, the only consequence will be that the cache is iterated once to determine that there's actually no failure, ensuring that the reconciliation logic still converges properly.

Fix issue which caused the map reconciliation process to never complete successfully if the error resolved automatically

@giorio94 giorio94 added kind/bug This is a bug in the Cilium logic. sig/loader Impacts the loading of BPF programs into the kernel. release-note/bug This PR fixes an issue in a previous release of Cilium. backport/1.14 This PR represents a backport for Cilium 1.14.x of a PR that was merged to main. labels Jul 10, 2023
@giorio94 giorio94 requested a review from a team as a code owner July 10, 2023 16:29
@giorio94 giorio94 requested a review from ti-mo July 10, 2023 16:29
@giorio94
Copy link
Member Author

/test

@giorio94 giorio94 added needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch and removed backport/1.14 This PR represents a backport for Cilium 1.14.x of a PR that was merged to main. labels Jul 13, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.14.0 Jul 13, 2023
@ti-mo
Copy link
Contributor

ti-mo commented Jul 18, 2023

@giorio94 Thanks for the patch! Would you be able to craft a regression test for this? This code may undergo significant changes throughout 1.15/1.16.

Cached maps come with a controller that retries Update/Delete
operations in case an error occurred while synchronizing the given
entry with the kernel. Currently, it relies on the `outstandingErrors`
counter to determine whether reconciliation is necessary, as well as
if any entry still needs to be processed in a subsequent operation.

Yet, it is possible that this counter gets out of sync, in particular
in case the given error is resolved automatically when a subsequent
operation acting on the same key succeed. If this happens, the
reconciliation function will never complete successfully, as there
will continue to be an error that cannot be resolved (as it no
longer exists). The given controller will this continue failing
forever until the agent gets restarted.

Let's fix this inferring the number of outstanding errors from the
cache itself. The `outstandingErrors` variable is preserved (although
converted to a boolean) to avoid iterating over the full cache in case
it is known that no error occurred. Still, if this flag gets out of
sync, the only consequence will be that the cache is iterated once
to determine that there's actually no failure, ensuring that the
reconciliation logic still converges properly.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the mio/bpf-map-sync-resolve-errors branch from 8fe6a4c to ed7ce3a Compare July 24, 2023 12:03
@giorio94
Copy link
Member Author

@giorio94 Thanks for the patch! Would you be able to craft a regression test for this? This code may undergo significant changes throughout 1.15/1.16.

I've added a unit test to assert that the error resolver works properly (and succeeds) both in case an element needs to be synchronized, and in case the issue already resolved automatically. The second test consistently fails without the patch introduced in this commit. @ti-mo PTAL.

@giorio94
Copy link
Member Author

/test

@aanm aanm added this to Needs backport from main in 1.14.1 Jul 26, 2023
@aanm aanm removed this from Needs backport from main in 1.14.0 Jul 26, 2023
@giorio94
Copy link
Member Author

/test

@giorio94
Copy link
Member Author

giorio94 commented Aug 8, 2023

Hey @ti-mo, would you have time to take another look at this?

@nebril nebril added this to Needs backport from main in 1.14.2 Aug 10, 2023
@nebril nebril removed this from Needs backport from main in 1.14.1 Aug 10, 2023
Copy link
Contributor

@ti-mo ti-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super familiar with this code in particular, but the approach and code seem reasonable to me. I don't want to hold this up any longer, thanks for adding the test!

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 16, 2023
@ti-mo ti-mo merged commit a331acd into cilium:main Aug 16, 2023
69 checks passed
@tklauser tklauser mentioned this pull request Aug 22, 2023
22 tasks
@tklauser tklauser added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Aug 22, 2023
@joestringer joestringer added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Aug 25, 2023
@joestringer joestringer moved this from Needs backport from main to Backport done to v1.14 in 1.14.2 Aug 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/loader Impacts the loading of BPF programs into the kernel.
Projects
No open projects
1.14.2
Backport done to v1.14
Development

Successfully merging this pull request may close these issues.

None yet

4 participants