Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/service: Handle leaked backends #24681

Merged
merged 3 commits into from
Apr 5, 2023

Conversation

aditighag
Copy link
Member

@aditighag aditighag commented Apr 1, 2023

The current logic to restore backends is brittle, and doesn't account for failure scenarios effectively.

pkg/service: Handle duplicate backends

In certain error scenarios, backends can be leaked, where
they were deleted from the userspace state, but left in the
datapath backends map. To reconcile datapath and userspace,
identify such backends that were created with different IDs
but same L3n4Addr hash.
This commit builds up on previous commits that don't bail out
on such error conditions (e.g., backend IDs mismatch during restore),
and tracks backends that are currently referenced in service entries
restored from the lb4_services map to restore backend entries.
Furthermore, it uses the tracked state to delete any duplicate backends
that were previously leaked.

Fixes: b79a4a53 (pkg/service: Gracefully terminate service backends)

pkg/service: Restore services prior to backends

The restore logic attempts to reconcile datapath state
with the userspace post agent restart.
Previously, it first restored backends from the `lb4_backends`
map before restoring service entries from the `lb4_services`
map. If there were error scenarios prior to agent restart (for
example, backend map full because of leaked backends), the logic
would fail to restore backends currently referenced in the services
map (and as a result, selected for load-balancing traffic).

This commit prioritizes restoring service entries followed by
backend entries. Follow-up commit handles error cases such as leaked
backends by keeping track of backends retrieved from restoration of
service entries, and then using that to subsequently restore backends.

pkg/service: Don't bail out on failures

The restore code attempts to reconcile datapath state with
the userspace state post agent restart. Bailing out early
on failures prevents any remediation from happening, so
log any errors. Follow-up commits will try to handle leaked
backends in the cluster if any.
Handle leaked service backends that may lead to filling up of `lb4_backends` map and thereby connectivity issues.

Relates: #23551

Signed-off-by: Aditi Ghag aditi@cilium.io

@aditighag aditighag requested a review from a team as a code owner April 1, 2023 03:03
@aditighag aditighag requested a review from aspsk April 1, 2023 03:03
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 1, 2023
@aditighag aditighag marked this pull request as draft April 1, 2023 03:04
@aditighag aditighag force-pushed the pr/aditighag/handle-leaked-backends branch from 67e706e to 545f4ec Compare April 1, 2023 03:20
@aditighag aditighag requested review from joamaki and removed request for aspsk April 1, 2023 04:02
@aditighag
Copy link
Member Author

/test

@tommyp1ckles tommyp1ckles self-requested a review April 3, 2023 17:04
pkg/service/service.go Outdated Show resolved Hide resolved
@aditighag aditighag force-pushed the pr/aditighag/handle-leaked-backends branch from 545f4ec to c332c19 Compare April 3, 2023 21:52
@aditighag aditighag added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Apr 3, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 3, 2023
@aditighag aditighag force-pushed the pr/aditighag/handle-leaked-backends branch from c332c19 to ac7bbed Compare April 3, 2023 22:00
pkg/service/service.go Outdated Show resolved Hide resolved
return err
// Restore service cache from BPF maps
if err := s.restoreServicesLocked(backendsById); err != nil {
errs = multierr.Append(errs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think this ever actually returns a non-nil error value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand. Can you elaborate? The zero value of errs is nil.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry should've specified the actual line, restoreServicesLocked doesn't ever appear to return anything but nil for its error.

pkg/service/service.go Show resolved Hide resolved
@aditighag aditighag force-pushed the pr/aditighag/handle-leaked-backends branch 2 times, most recently from 77b3f9f to 9748f25 Compare April 3, 2023 23:23
@aditighag
Copy link
Member Author

/test

@christarazi christarazi added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label Apr 4, 2023
@aditighag
Copy link
Member Author

/test-1.26-net-next

@aditighag aditighag added needs-backport/1.11 needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Apr 4, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.13.2 Apr 4, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.12.9 Apr 4, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.11.16 Apr 4, 2023
@aditighag
Copy link
Member Author

All tests passed, except 5.4 (the sole failure is same as https://github.com/cilium/cilium/issues?q=is%3Aissue+is%3Aopen+K8sDatapathConfig+Host+firewall+With+VXLAN+and+endpoint+routes) and bpf-next.

Screen Shot 2023-04-04 at 10 31 55 AM

The net-next failures are unrelated to the PR changes, and the PR needs to be rebased -

subsys=daemon
2023-04-04T15:24:45.955102385Z level=error msg="unable to initialize kube-proxy replacement options" error="Invalid value for --bpf-lb-dsr-dispatch: geneve" subsys=daemon
2023-04-04T15:24:45.955116481Z level=error msg="Start hook failed" error="daemon creation failed: unable to initialize kube-proxy replacement options: Invalid value for --bpf-lb-dsr-dispatch: geneve" function="cmd.newDaemonPromise.func1 (daemon_main.go:1623)" subsys=hive
2023-04-04T15:24:45.955198739Z level=info msg=Stopping subsys=hive
2023-04-04T15:24:45.955218839Z level=debug msg="Executing stop hook" function="egressgateway.NewEgressGatewayManager.func2 (manager.go:144)" subsys=hive
2023-04-04T15:24:45.955221537Z level=info msg="Stop hook executed" duration=801ns function="egressgateway.NewEgressGatewayManager.func2 (manager.go:144)" subsys=hive
2023-04-04T15:24:45.955223357Z level=debug msg="Executing stop hook" function="monitor.(*dropMonitor).OnStop" subsys=hive
2023-04-04T15:24:45.955248018Z level=info msg="Stop hook executed" duration=312ns function="monitor.(*dropMonitor).OnStop" subsys=hive
2023-04-04T15:24:45.955256824Z level=debug msg="Executing stop hook" function="*manager.manager.Stop" subsys=hive
2023-04-04T15:24:45.955271453Z level=info msg="Stop hook executed" duration="49.637µs" function="*manager.manager.Stop" subsys=hive
2023-04-04T15:24:45.955273869Z level=debug msg="Executing stop hook" function="cmd.newPolicyTrifecta.func2 (policy.go:132)" subsys=hive
2023-04-04T15:24:45.955314739Z level=panic msg="Close() called without calling InitIdentityAllocator() first" subsys=identity-cache
2023-04-04T15:24:45.957591934Z panic: (*logrus.Entry) 0xc000395b90
2023-04-04T15:24:45.957598935Z 
2023-04-04T15:24:45.957601435Z goroutine 1 [running]:
2023-04-04T15:24:45.957603648Z github.com/sirupsen/logrus.(*Entry).log(0xc00018e690, 0x0, {0xc000f5ee80, 0x3c})
2023-04-04T15:24:45.957605367Z 	/go/src/github.com/cilium/cilium/vendor/github.com/sirupsen/logrus/entry.go:260 +0x4d6
2023-04-04T15:24:45.957606946Z github.com/sirupsen/logrus.(*Entry).Log(0xc00018e690, 0x0, {0xc00106b5d8?, 0xc000efa000?, 0x3?})
2023-04-04T15:24:45.957608554Z 	/go/src/github.com/cilium/cilium/vendor/github.com/sirupsen/logrus/entry.go:304 +0x4f
2023-04-04T15:24:45.957610311Z github.com/sirupsen/logrus.(*Entry).Panic(...)
2023-04-04T15:24:45.957611885Z 	/go/src/github.com/cilium/cilium/vendor/github.com/sirupsen/logrus/entry.go:342
2023-04-04T15:24:45.957616152Z github.com/cilium/cilium/pkg/identity/cache.(*CachingIdentityAllocator).Close(0xc000cfc9c0)
2023-04-04T15:24:45.957617886Z 	/go/src/github.com/cilium/cilium/pkg/identity/cache/allocator.go:260 +0xdc
2023-04-04T15:24:45.957621003Z github.com/cilium/cilium/daemon/cmd.newPolicyTrifecta.func2({0x309f5a0?, 0x493e00?})

The restore code attempts to reconcile datapath state with
the userspace state post agent restart. Bailing out early
on failures prevents any remediation from happening, so
log any errors. Follow-up commits will try to handle leaked
backends in the cluster if any.

Signed-off-by: Aditi Ghag <aditi@cilium.io>
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.13 in 1.13.2 Apr 5, 2023
@jibi jibi mentioned this pull request Apr 5, 2023
6 tasks
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.12 in 1.12.9 Apr 5, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.12 in 1.12.9 Apr 5, 2023
@jibi jibi mentioned this pull request Apr 5, 2023
5 tasks
@aditighag aditighag added affects/v1.12 This issue affects v1.12 branch affects/v1.13 This issue affects v1.13 branch affects/v1.11 This issue affects v1.11 branch labels Apr 5, 2023
@aditighag
Copy link
Member Author

Nominated for backports to older branches, as if there were any backends leaked previously, those need to be cleaned-up during restore path after agent restart.

@jibi jibi added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Apr 7, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.13 to Backport done to v1.13 in 1.13.2 Apr 7, 2023
@jibi jibi added backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. and removed backport-pending/1.12 labels Apr 7, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.12 to Backport done to v1.12 in 1.12.9 Apr 7, 2023
@pchaigno pchaigno mentioned this pull request Apr 11, 2023
6 tasks
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.11 in 1.11.16 Apr 12, 2023
@michi-covalent michi-covalent added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Apr 14, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.11 to Backport done to v1.11 in 1.11.16 Apr 14, 2023
@@ -1369,6 +1374,35 @@ func (s *Service) restoreBackendsLocked() error {
logfields.BackendState: b.State,
logfields.BackendPreferred: b.Preferred,
}).Debug("Restoring backend")
if _, ok := svcBackendsById[b.ID]; !ok && s.backendRefCount[b.L3n4Addr.Hash()] != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditighag Too late to the party, but a few questions:

  • Isn't the s.backendRefCount[b.L3n4Addr.Hash()] != 0 check redundant, as the refcount of such a backend can be incremented only if it belongs to a service (thus, !ok would evaluate to false).
  • Do I read it correctly that backends in the Terminating state would be removed before the grace period happens? I think it's not a big deal, as cilium-agent restarts do not expect to happen very often. Just what happens if we receive an EndpointSlice delete event, and cilium-agent doesn't have any reference to previously Terminating backends?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing it would be helpful if you could elaborate on (e.g., in a commit msg) under what circumstances the backend leaks happened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.11 This issue affects v1.11 branch affects/v1.12 This issue affects v1.12 branch affects/v1.13 This issue affects v1.13 branch backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-blocker/1.12 This issue will prevent the release of the next version of Cilium. release-blocker/1.13 This issue will prevent the release of the next version of Cilium. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
No open projects
1.11.16
Backport done to v1.11
1.12.9
Backport done to v1.12
1.13.2
Backport done to v1.13
Development

Successfully merging this pull request may close these issues.

None yet

9 participants