pkg/service: Handle leaked backends #24681

aditighag · 2023-04-01T03:03:55Z

The current logic to restore backends is brittle, and doesn't account for failure scenarios effectively.

pkg/service: Handle duplicate backends

In certain error scenarios, backends can be leaked, where
they were deleted from the userspace state, but left in the
datapath backends map. To reconcile datapath and userspace,
identify such backends that were created with different IDs
but same L3n4Addr hash.
This commit builds up on previous commits that don't bail out
on such error conditions (e.g., backend IDs mismatch during restore),
and tracks backends that are currently referenced in service entries
restored from the lb4_services map to restore backend entries.
Furthermore, it uses the tracked state to delete any duplicate backends
that were previously leaked.

Fixes: b79a4a53 (pkg/service: Gracefully terminate service backends)

pkg/service: Restore services prior to backends

The restore logic attempts to reconcile datapath state
with the userspace post agent restart.
Previously, it first restored backends from the `lb4_backends`
map before restoring service entries from the `lb4_services`
map. If there were error scenarios prior to agent restart (for
example, backend map full because of leaked backends), the logic
would fail to restore backends currently referenced in the services
map (and as a result, selected for load-balancing traffic).

This commit prioritizes restoring service entries followed by
backend entries. Follow-up commit handles error cases such as leaked
backends by keeping track of backends retrieved from restoration of
service entries, and then using that to subsequently restore backends.

pkg/service: Don't bail out on failures

The restore code attempts to reconcile datapath state with
the userspace state post agent restart. Bailing out early
on failures prevents any remediation from happening, so
log any errors. Follow-up commits will try to handle leaked
backends in the cluster if any.

Handle leaked service backends that may lead to filling up of `lb4_backends` map and thereby connectivity issues.

Relates: #23551

Signed-off-by: Aditi Ghag aditi@cilium.io

aditighag · 2023-04-01T04:03:09Z

/test

pkg/service/service.go

tommyp1ckles · 2023-04-03T22:08:59Z

pkg/service/service.go

-		return err
+	// Restore service cache from BPF maps
+	if err := s.restoreServicesLocked(backendsById); err != nil {
+		errs = multierr.Append(errs,


nit: I don't think this ever actually returns a non-nil error value.

Sorry, I don't understand. Can you elaborate? The zero value of errs is nil.

sorry should've specified the actual line, restoreServicesLocked doesn't ever appear to return anything but nil for its error.

pkg/service/service.go

aditighag · 2023-04-04T00:03:57Z

/test

aditighag · 2023-04-04T14:20:03Z

/test-1.26-net-next

aditighag · 2023-04-04T17:32:19Z

All tests passed, except 5.4 (the sole failure is same as https://github.com/cilium/cilium/issues?q=is%3Aissue+is%3Aopen+K8sDatapathConfig+Host+firewall+With+VXLAN+and+endpoint+routes) and bpf-next.

The net-next failures are unrelated to the PR changes, and the PR needs to be rebased -

subsys=daemon
2023-04-04T15:24:45.955102385Z level=error msg="unable to initialize kube-proxy replacement options" error="Invalid value for --bpf-lb-dsr-dispatch: geneve" subsys=daemon
2023-04-04T15:24:45.955116481Z level=error msg="Start hook failed" error="daemon creation failed: unable to initialize kube-proxy replacement options: Invalid value for --bpf-lb-dsr-dispatch: geneve" function="cmd.newDaemonPromise.func1 (daemon_main.go:1623)" subsys=hive
2023-04-04T15:24:45.955198739Z level=info msg=Stopping subsys=hive
2023-04-04T15:24:45.955218839Z level=debug msg="Executing stop hook" function="egressgateway.NewEgressGatewayManager.func2 (manager.go:144)" subsys=hive
2023-04-04T15:24:45.955221537Z level=info msg="Stop hook executed" duration=801ns function="egressgateway.NewEgressGatewayManager.func2 (manager.go:144)" subsys=hive
2023-04-04T15:24:45.955223357Z level=debug msg="Executing stop hook" function="monitor.(*dropMonitor).OnStop" subsys=hive
2023-04-04T15:24:45.955248018Z level=info msg="Stop hook executed" duration=312ns function="monitor.(*dropMonitor).OnStop" subsys=hive
2023-04-04T15:24:45.955256824Z level=debug msg="Executing stop hook" function="*manager.manager.Stop" subsys=hive
2023-04-04T15:24:45.955271453Z level=info msg="Stop hook executed" duration="49.637µs" function="*manager.manager.Stop" subsys=hive
2023-04-04T15:24:45.955273869Z level=debug msg="Executing stop hook" function="cmd.newPolicyTrifecta.func2 (policy.go:132)" subsys=hive
2023-04-04T15:24:45.955314739Z level=panic msg="Close() called without calling InitIdentityAllocator() first" subsys=identity-cache
2023-04-04T15:24:45.957591934Z panic: (*logrus.Entry) 0xc000395b90
2023-04-04T15:24:45.957598935Z 
2023-04-04T15:24:45.957601435Z goroutine 1 [running]:
2023-04-04T15:24:45.957603648Z github.com/sirupsen/logrus.(*Entry).log(0xc00018e690, 0x0, {0xc000f5ee80, 0x3c})
2023-04-04T15:24:45.957605367Z 	/go/src/github.com/cilium/cilium/vendor/github.com/sirupsen/logrus/entry.go:260 +0x4d6
2023-04-04T15:24:45.957606946Z github.com/sirupsen/logrus.(*Entry).Log(0xc00018e690, 0x0, {0xc00106b5d8?, 0xc000efa000?, 0x3?})
2023-04-04T15:24:45.957608554Z 	/go/src/github.com/cilium/cilium/vendor/github.com/sirupsen/logrus/entry.go:304 +0x4f
2023-04-04T15:24:45.957610311Z github.com/sirupsen/logrus.(*Entry).Panic(...)
2023-04-04T15:24:45.957611885Z 	/go/src/github.com/cilium/cilium/vendor/github.com/sirupsen/logrus/entry.go:342
2023-04-04T15:24:45.957616152Z github.com/cilium/cilium/pkg/identity/cache.(*CachingIdentityAllocator).Close(0xc000cfc9c0)
2023-04-04T15:24:45.957617886Z 	/go/src/github.com/cilium/cilium/pkg/identity/cache/allocator.go:260 +0xdc
2023-04-04T15:24:45.957621003Z github.com/cilium/cilium/daemon/cmd.newPolicyTrifecta.func2({0x309f5a0?, 0x493e00?})

The restore code attempts to reconcile datapath state with the userspace state post agent restart. Bailing out early on failures prevents any remediation from happening, so log any errors. Follow-up commits will try to handle leaked backends in the cluster if any. Signed-off-by: Aditi Ghag <aditi@cilium.io>

aditighag · 2023-04-05T18:17:42Z

Nominated for backports to older branches, as if there were any backends leaked previously, those need to be cleaned-up during restore path after agent restart.

brb · 2023-04-20T08:51:32Z

pkg/service/service.go

@@ -1369,6 +1374,35 @@ func (s *Service) restoreBackendsLocked() error {
 			logfields.BackendState:     b.State,
 			logfields.BackendPreferred: b.Preferred,
 		}).Debug("Restoring backend")
+		if _, ok := svcBackendsById[b.ID]; !ok && s.backendRefCount[b.L3n4Addr.Hash()] != 0 {


@aditighag Too late to the party, but a few questions:

Isn't the s.backendRefCount[b.L3n4Addr.Hash()] != 0 check redundant, as the refcount of such a backend can be incremented only if it belongs to a service (thus, !ok would evaluate to false).

Do I read it correctly that backends in the Terminating state would be removed before the grace period happens? I think it's not a big deal, as cilium-agent restarts do not expect to happen very often. Just what happens if we receive an EndpointSlice delete event, and cilium-agent doesn't have any reference to previously Terminating backends?

Another thing it would be helpful if you could elaborate on (e.g., in a commit msg) under what circumstances the backend leaks happened.

aditighag requested a review from a team as a code owner April 1, 2023 03:03

aditighag requested a review from aspsk April 1, 2023 03:03

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 1, 2023

aditighag marked this pull request as draft April 1, 2023 03:04

aditighag force-pushed the pr/aditighag/handle-leaked-backends branch from 67e706e to 545f4ec Compare April 1, 2023 03:20

aditighag requested review from joamaki and removed request for aspsk April 1, 2023 04:02

tommyp1ckles self-requested a review April 3, 2023 17:04

tommyp1ckles reviewed Apr 3, 2023

View reviewed changes

pkg/service/service.go Outdated Show resolved Hide resolved

aditighag force-pushed the pr/aditighag/handle-leaked-backends branch from 545f4ec to c332c19 Compare April 3, 2023 21:52

aditighag added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Apr 3, 2023

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 3, 2023

aditighag force-pushed the pr/aditighag/handle-leaked-backends branch from c332c19 to ac7bbed Compare April 3, 2023 22:00

tommyp1ckles reviewed Apr 3, 2023

View reviewed changes

aditighag force-pushed the pr/aditighag/handle-leaked-backends branch 2 times, most recently from 77b3f9f to 9748f25 Compare April 3, 2023 23:23

tommyp1ckles approved these changes Apr 3, 2023

View reviewed changes

christarazi added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label Apr 4, 2023

joamaki approved these changes Apr 4, 2023

View reviewed changes

aditighag added needs-backport/1.11 needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Apr 4, 2023

maintainer-s-little-helper bot added this to Needs backport from master in 1.13.2 Apr 4, 2023

maintainer-s-little-helper bot added this to Needs backport from master in 1.12.9 Apr 4, 2023

maintainer-s-little-helper bot added this to Needs backport from master in 1.11.16 Apr 4, 2023

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.13 in 1.13.2 Apr 5, 2023

jibi mentioned this pull request Apr 5, 2023

v1.13 Backports 2023-04-05 #24758

Merged

6 tasks

jibi added backport-pending/1.12 and removed needs-backport/1.12 labels Apr 5, 2023

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.12 in 1.12.9 Apr 5, 2023

jibi mentioned this pull request Apr 5, 2023

v1.12 Backports 2023-04-05 #24761

Merged

5 tasks

aditighag added affects/v1.12 This issue affects v1.12 branch affects/v1.13 This issue affects v1.13 branch affects/v1.11 This issue affects v1.11 branch labels Apr 5, 2023

aditighag mentioned this pull request Apr 5, 2023

pkg/service: Backends leak follow ups with revised fixes, debugging improvements and unit tests #24770

Merged

aditighag mentioned this pull request Apr 5, 2023

Inconsistent output between bpftool and cilium map get for cilium_lb4_backends_v2 and many duplicated entries with different id in this map. #23551

Closed

2 tasks

jibi added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Apr 7, 2023

maintainer-s-little-helper bot moved this from Backport pending to v1.13 to Backport done to v1.13 in 1.13.2 Apr 7, 2023

jibi added backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. and removed backport-pending/1.12 labels Apr 7, 2023

maintainer-s-little-helper bot moved this from Backport pending to v1.12 to Backport done to v1.12 in 1.12.9 Apr 7, 2023

pchaigno mentioned this pull request Apr 11, 2023

v1.11 backports 2023-04-12 #24823

Merged

6 tasks

gandro added backport-pending/1.11 and removed needs-backport/1.11 labels Apr 12, 2023

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.11 in 1.11.16 Apr 12, 2023

michi-covalent mentioned this pull request Apr 14, 2023

Prepare for release v1.12.9 #24879

Merged

michi-covalent added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Apr 14, 2023

maintainer-s-little-helper bot moved this from Backport pending to v1.11 to Backport done to v1.11 in 1.11.16 Apr 14, 2023

michi-covalent mentioned this pull request Apr 14, 2023

Prepare for release v1.11.16 #24880

Merged

gentoo-root mentioned this pull request Apr 14, 2023

Prepare for release v1.13.2 #24900

Merged

brb reviewed Apr 20, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/service: Handle leaked backends #24681

pkg/service: Handle leaked backends #24681

aditighag commented Apr 1, 2023 •

edited

aditighag commented Apr 1, 2023

tommyp1ckles Apr 3, 2023

aditighag Apr 3, 2023

tommyp1ckles Apr 3, 2023

aditighag commented Apr 4, 2023

aditighag commented Apr 4, 2023

aditighag commented Apr 4, 2023

aditighag commented Apr 5, 2023

brb Apr 20, 2023

brb Apr 20, 2023

pkg/service: Handle leaked backends #24681

pkg/service: Handle leaked backends #24681

Conversation

aditighag commented Apr 1, 2023 • edited

aditighag commented Apr 1, 2023

tommyp1ckles Apr 3, 2023

Choose a reason for hiding this comment

aditighag Apr 3, 2023

Choose a reason for hiding this comment

tommyp1ckles Apr 3, 2023

Choose a reason for hiding this comment

aditighag commented Apr 4, 2023

aditighag commented Apr 4, 2023

aditighag commented Apr 4, 2023

aditighag commented Apr 5, 2023

brb Apr 20, 2023

Choose a reason for hiding this comment

brb Apr 20, 2023

Choose a reason for hiding this comment

aditighag commented Apr 1, 2023 •

edited