map: fix reconciliation failure caused by out of sync errors number #26742

giorio94 · 2023-07-10T16:29:06Z

Cached maps come with a controller that retries Update/Delete operations in case an error occurred while synchronizing the given entry with the kernel. Currently, it relies on the outstandingErrors counter to determine whether reconciliation is necessary, as well as if any entry still needs to be processed in a subsequent operation.

Yet, it is possible that this counter gets out of sync, in particular in case the given error is resolved automatically when a subsequent operation acting on the same key succeed. If this happens, the reconciliation function will never complete successfully, as there will continue to be an error that cannot be resolved (as it no longer exists). The given controller will this continue failing forever until the agent gets restarted.

Let's fix this inferring the number of outstanding errors from the cache itself. The outstandingErrors variable is preserved (although converted to a boolean) to avoid iterating over the full cache in case it is known that no error occurred. Still, if this flag gets out of sync, the only consequence will be that the cache is iterated once to determine that there's actually no failure, ensuring that the reconciliation logic still converges properly.

Fix issue which caused the map reconciliation process to never complete successfully if the error resolved automatically

giorio94 · 2023-07-11T06:57:01Z

/test

ti-mo · 2023-07-18T11:00:54Z

@giorio94 Thanks for the patch! Would you be able to craft a regression test for this? This code may undergo significant changes throughout 1.15/1.16.

Cached maps come with a controller that retries Update/Delete operations in case an error occurred while synchronizing the given entry with the kernel. Currently, it relies on the `outstandingErrors` counter to determine whether reconciliation is necessary, as well as if any entry still needs to be processed in a subsequent operation. Yet, it is possible that this counter gets out of sync, in particular in case the given error is resolved automatically when a subsequent operation acting on the same key succeed. If this happens, the reconciliation function will never complete successfully, as there will continue to be an error that cannot be resolved (as it no longer exists). The given controller will this continue failing forever until the agent gets restarted. Let's fix this inferring the number of outstanding errors from the cache itself. The `outstandingErrors` variable is preserved (although converted to a boolean) to avoid iterating over the full cache in case it is known that no error occurred. Still, if this flag gets out of sync, the only consequence will be that the cache is iterated once to determine that there's actually no failure, ensuring that the reconciliation logic still converges properly. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

giorio94 · 2023-07-24T12:06:31Z

@giorio94 Thanks for the patch! Would you be able to craft a regression test for this? This code may undergo significant changes throughout 1.15/1.16.

I've added a unit test to assert that the error resolver works properly (and succeeds) both in case an element needs to be synchronized, and in case the issue already resolved automatically. The second test consistently fails without the patch introduced in this commit. @ti-mo PTAL.

giorio94 · 2023-07-24T12:07:11Z

/test

giorio94 · 2023-07-27T08:39:26Z

/test

giorio94 · 2023-08-08T13:26:36Z

Hey @ti-mo, would you have time to take another look at this?

ti-mo

Not super familiar with this code in particular, but the approach and code seem reasonable to me. I don't want to hold this up any longer, thanks for adding the test!

giorio94 added kind/bug sig/loader release-note/bug backport/1.14 labels Jul 10, 2023

giorio94 requested a review from a team as a code owner July 10, 2023 16:29

giorio94 requested a review from ti-mo July 10, 2023 16:29

giorio94 added needs-backport/1.14 and removed backport/1.14 labels Jul 13, 2023

giorio94 force-pushed the mio/bpf-map-sync-resolve-errors branch from 8fe6a4c to ed7ce3a Compare July 24, 2023 12:03

ti-mo approved these changes Aug 16, 2023

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge label Aug 16, 2023

ti-mo merged commit a331acd into cilium:main Aug 16, 2023

tklauser mentioned this pull request Aug 22, 2023

v1.14 Backports 2023-08-22 #27629

Merged

22 tasks

tklauser added backport-pending/1.14 and removed needs-backport/1.14 labels Aug 22, 2023

joestringer added backport-done/1.14 and removed backport-pending/1.14 labels Aug 25, 2023

michi-covalent mentioned this pull request Sep 9, 2023

Prepare for release v1.14.2 #28052

Merged

tamilmani1989 mentioned this pull request Nov 2, 2023

Unable to update element for" cilium_lb4_backends_v2 map with file descriptor 17: the map is full, please consider resizing it. argument list too long" #28726

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

map: fix reconciliation failure caused by out of sync errors number #26742

map: fix reconciliation failure caused by out of sync errors number #26742

giorio94 commented Jul 10, 2023

giorio94 commented Jul 11, 2023

ti-mo commented Jul 18, 2023

giorio94 commented Jul 24, 2023

giorio94 commented Jul 24, 2023

giorio94 commented Jul 27, 2023

giorio94 commented Aug 8, 2023

ti-mo left a comment

map: fix reconciliation failure caused by out of sync errors number #26742

map: fix reconciliation failure caused by out of sync errors number #26742

Conversation

giorio94 commented Jul 10, 2023

giorio94 commented Jul 11, 2023

ti-mo commented Jul 18, 2023

giorio94 commented Jul 24, 2023

giorio94 commented Jul 24, 2023

giorio94 commented Jul 27, 2023

giorio94 commented Aug 8, 2023

ti-mo left a comment

Choose a reason for hiding this comment