-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/service: Handle leaked backends #24681
pkg/service: Handle leaked backends #24681
Conversation
67e706e
to
545f4ec
Compare
/test |
545f4ec
to
c332c19
Compare
c332c19
to
ac7bbed
Compare
return err | ||
// Restore service cache from BPF maps | ||
if err := s.restoreServicesLocked(backendsById); err != nil { | ||
errs = multierr.Append(errs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't think this ever actually returns a non-nil error value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand. Can you elaborate? The zero value of errs is nil
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry should've specified the actual line, restoreServicesLocked doesn't ever appear to return anything but nil for its error.
77b3f9f
to
9748f25
Compare
/test |
/test-1.26-net-next |
All tests passed, except 5.4 (the sole failure is same as https://github.com/cilium/cilium/issues?q=is%3Aissue+is%3Aopen+K8sDatapathConfig+Host+firewall+With+VXLAN+and+endpoint+routes) and bpf-next. The net-next failures are unrelated to the PR changes, and the PR needs to be rebased -
|
The restore code attempts to reconcile datapath state with the userspace state post agent restart. Bailing out early on failures prevents any remediation from happening, so log any errors. Follow-up commits will try to handle leaked backends in the cluster if any. Signed-off-by: Aditi Ghag <aditi@cilium.io>
Nominated for backports to older branches, as if there were any backends leaked previously, those need to be cleaned-up during restore path after agent restart. |
@@ -1369,6 +1374,35 @@ func (s *Service) restoreBackendsLocked() error { | |||
logfields.BackendState: b.State, | |||
logfields.BackendPreferred: b.Preferred, | |||
}).Debug("Restoring backend") | |||
if _, ok := svcBackendsById[b.ID]; !ok && s.backendRefCount[b.L3n4Addr.Hash()] != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aditighag Too late to the party, but a few questions:
- Isn't the
s.backendRefCount[b.L3n4Addr.Hash()] != 0
check redundant, as the refcount of such a backend can be incremented only if it belongs to a service (thus,!ok
would evaluate to false). - Do I read it correctly that backends in the
Terminating
state would be removed before the grace period happens? I think it's not a big deal, as cilium-agent restarts do not expect to happen very often. Just what happens if we receive an EndpointSlice delete event, and cilium-agent doesn't have any reference to previouslyTerminating
backends?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing it would be helpful if you could elaborate on (e.g., in a commit msg) under what circumstances the backend leaks happened.
The current logic to restore backends is brittle, and doesn't account for failure scenarios effectively.
pkg/service: Handle duplicate backends
pkg/service: Restore services prior to backends
pkg/service: Don't bail out on failures
Relates: #23551
Signed-off-by: Aditi Ghag aditi@cilium.io