services: prevent temporary connectivity loss on agent restart #26912

giorio94 · 2023-07-19T07:29:31Z

Cilium already implements a restore path to prevent dropping existing connections on agent restart. Yet, there's currently an issue which causes the removal of valid backends from a service when receiving incomplete service updates, either because backends are spread across multiple endpointslices or some belong to remote clusters. Indeed, all previously known backends get replaced with the ones we just heard about (and present as part of the service cache event), possibly causing connectivity disruptions.

Let's prevent this behavior keeping a list of restored backends for each service, and continuing merging them with the ones we received an update for, until the bootstrap phase completes. After synchronization, an update is triggered for each service still associated with stale backends, so that they can be removed.

Fixes: #23823
Fixes: #26944

Fix possible connection drops on agents restart when a service is associated with multiple endpointslices or has backends across multiple clusters

giorio94 · 2023-07-20T07:08:25Z

/test

giorio94 · 2023-07-20T07:30:58Z

/test

brb · 2023-07-20T11:23:31Z

This PR resolves #26944.

christarazi

LGTM, good find!

Minor comments below.

pkg/service/service.go

christarazi · 2023-07-20T19:46:09Z

pkg/service/service.go

@@ -1072,7 +1077,27 @@ func (s *Service) restoreAndDeleteOrphanSourceRanges() error {
 //
 // The removal is based on an assumption that during the sync period
 // UpsertService() is going to be called for each alive service.
-func (s *Service) SyncWithK8sFinished() error {
+func (s *Service) SyncWithK8sFinished(ensurer func(k8s.ServiceID, *lock.StoppableWaitGroup) bool) error {


Did you consider making this function return the set of services and then call ensurer on the return of SyncWithK8sFinished instead? Might make it easier to understand what ensurer is for. Anonymous functions can make the code a bit more convenient to write, but hinders readability. This is a minor comment though because there's only one call site of SyncWithK8sFinished, so mostly just wondering what you've considered.

I personally don't have a strong opinion here. IMO the current approach has the advantage that the entire synchronization logic is self-contained in a single function, which is then easier to be tested. Another possibility might be to propagate the entire service cache (rather than the single function) as parameter (but that would make testing harder), or replace the anonymous function with an interface (I initially discarded this option as being more verbose, but it might improve clarity). WDYT?

Given that from an offline conversation about this function we talked about it being executed in a controller although it's one-off, maybe we just defer the refactoring of this function anyway (into a hive job or something). I'm fine as it is.

aditighag

Nice find! Fix looks good to me overall with some requested changes.

Do we want this logic to execute in case of standalone LB?

removal of valid backends from a service when receiving incomplete service updates, either because backends are spread across multiple endpointslices or some belong to remote clusters.

I wonder how this issue go unnoticed until now. Is there a scale aspect to it, or did something change on k8s side?
I'm not too familiar with clustermesh, so this might be a newbie question. But does clustermesh run multiple instances of a k8s api server? If not, why would it matter that backends belong to remote clusters?

pkg/service/service.go

aditighag · 2023-07-20T21:52:26Z

pkg/service/service_test.go

@@ -611,7 +615,7 @@ func (m *ManagerTestSuite) TestSyncWithK8sFinished(c *C) {

 	// cilium-agent finished the initialization, and thus SyncWithK8sFinished
 	// is called
-	err = m.svc.SyncWithK8sFinished()
+	err = m.svc.SyncWithK8sFinished(func(k8s.ServiceID, *lock.StoppableWaitGroup) bool { return true })


SyncWithK8sFinished is called conditionally (see initRestore). Can we add one more test with restore flag disabled, just to make sure we don't have any oops moments?

I've made a few attempts here, but I couldn't find any good way to test this condition without a large refactoring. The main issue is that the synchronization logic is triggered asynchronously through a controller, hence making it very difficult to wait for its termination. Given that the initRestore function is already untested (hence there's no check that the other operations are also skipped when RestoreStatus is unset) and the condition is pretty trivial, I'd personally skip the implementation of this test for the moment.

It isn't so much about testing initRestore, as about making sure there are no deadlocks, or we don't assume that syncWithK8sFinished is always invoked (e.g., standalone LB).

Yeah, that logic is already triggered only in case that RestoreState is set, and the clientset is enabled (which excludes the standalone LB case). It is currently tricky to further test these aspects, because the syncWithKsFinished function (which by itself is covered by unit tests) is run asynchronously.

I'd propose to defer the introduction of additional tests to a subsequent refactoring which extracts this logic to be executed by an hive job, as mentioned by Chris. I'd not include the refactoring in this PR though, to reduce churn in older versions and because that would complicate backporting given that hive jobs support has only been introduced recently.

ysksuzuki

The scenario in which the problem occurs is as follows, and this PR fixes this problem by delaying the deletion of the restored backends until the synchronization with kube-apiserver is complete. Is my understanding correct?

Restore services and backends from the BPF map
Sync with kube-apiserver and reflect the actual state
- If the k8sServiceHandler receives an incomplete event like the following(event1 is incomplete), a valid backend is removed prematurely
  - event1{service1: backend1 } <- a valid backend2 is removed (the informer cache sync delays, perhaps?)
  - event2{service1: backend1, backend2 } <- the removed backend2 is back

Is this scenario that the handler receives an incomplete event likely to happen if the backends spread across multiple endpointslices or some belong to remote clusters?

pkg/service/service.go

brb · 2023-07-21T07:15:20Z

Do we want this logic to execute in case of standalone LB?

Today, the standalone LB mode doesn't assume any connectivity to kube-apiserver. Services are programmed through Cilium API.

giorio94 · 2023-07-21T08:11:09Z

I wonder how this issue go unnoticed until now. Is there a scale aspect to it, or did something change on k8s side?

Yes, in the sense that this issue doesn't occur in case all the backends are contained in a single endpointslice (given that we see only a single atomic update). By default, an endpointslice contains 100 endpoints, hence Cilium may (depending on events ordering this might also not occur) currently remove valid backends only when a service is associated with more than 100 pods.

I'm not too familiar with clustermesh, so this might be a newbie question. But does clustermesh run multiple instances of a k8s api server? If not, why would it matter that backends belong to remote clusters?

Essentially, each cluster exposes an etcd cluster and synchronizes there a subset of the information available as Kubernetes resources, including the list of the services which are marked as shared, and the associated backends. Remote agents connect to these etcd clusters to pull the various objects and merge them with the local state. In the services case, the remote backends get merged into the service cache, as if they were just a different endpointslice. I personally discovered the issue in this scenario, as a single backend is enough to trigger the connectivity drop (opposed to the local case).

brb · 2023-07-21T08:13:12Z

I wonder how this issue go unnoticed until now. Is there a scale aspect to it, or did something change on k8s side?

How I hit this - add DualStack services, and then do Cilium upgrade.

giorio94 · 2023-07-21T08:26:13Z

The scenario in which the problem occurs is as follows, and this PR fixes this problem by delaying the deletion of the restored backends until the synchronization with kube-apiserver is complete. Is my understanding correct?
* Restore services and backends from the BPF map

* Sync with kube-apiserver and reflect the actual state
  
  * If the k8sServiceHandler receives an incomplete event like the following(event1 is incomplete), a valid backend is removed prematurely
    
    * event1{service1: backend1 } <- a valid backend2 is removed (the informer cache sync delays, perhaps?)
    * event2{service1: backend1, backend2 } <- the removed backend2 is back

Exactly. Essentially, consider the case in which a given service foo is associated with the foo-1 and foo-2 epslices each containing a set of backends.

As you mentioned, upon restart, Cilium restores services and backends from the BPF map. Then it starts the service and epslice informers, both of which propagate the event to the service cache. Let's say that we first receive the event about the service: at this point the service is considered not ready (as we have not yet seen any epslice), and nothing gets propagated. Then, we receive the event for the foo-1 epslice, the service cache processes it and, given that the service has now backends, propagates an event down to the service subsystem for the service foo, including all backends part of foo-1. At this point, all previously known backends get replaced by the new ones in the BPF maps, dropping the connections targeting the backends that were part of the foo-2 epslice. Once an event for that epslice is also seen, then the backends will be merged and restored.

The only case in which this issue doesn't happen is when we first receive both foo-1 and foo-2 events from kubernetes, and only afterwards the one for the service itself.

The modification implemented in this PR ensures that in the service subsystem we continue preserving the backends restored from the BPF maps also once we receive the first service update, given that at that point some backends might still be missing.

Is this scenario that the handler receives an incomplete event likely to happen if the backends spread across multiple endpointslices or some belong to remote clusters?

Yes, it doesn't happen in case of a single endpointslice, because all backends would be part of a single event atomically received from the k8s informer. And the clustermesh case essentially corresponds to having another epslice for each remote cluster.

giorio94 · 2023-07-21T08:33:29Z

How I hit this - add DualStack services, and then do Cilium upgrade.

TIL that endpointslices are created per family. Hence, if the service is of type dualstack, we'll have two separate endpointslices (regardless of whether pods are dualstack or not). Hence, causing the same issue.

giorio94 · 2023-07-21T09:20:25Z

/test

giorio94 · 2023-07-21T09:40:41Z

/test

giorio94 · 2023-07-21T09:57:33Z

I've double-checked, and this issue additionally affects v1.11 (hence also v1.12).

aditighag

Do we want this logic to execute in case of standalone LB?

Today, the standalone LB mode doesn't assume any connectivity to kube-apiserver. Services are programmed through Cilium API.

Right, so should this logic be skipped for standalone LB case?

Do you also want to mention the ipv{4, 6} case in the commit description that Martynas linked above?

Approving the change, as I'll be on PTO next week. Please check on unit testing the standalone case where syncWithK8sFinished is a no-op. Thanks!

aditighag · 2023-07-21T22:04:36Z

pkg/service/service_test.go

@@ -611,7 +615,7 @@ func (m *ManagerTestSuite) TestSyncWithK8sFinished(c *C) {

 	// cilium-agent finished the initialization, and thus SyncWithK8sFinished
 	// is called
-	err = m.svc.SyncWithK8sFinished()
+	err = m.svc.SyncWithK8sFinished(func(k8s.ServiceID, *lock.StoppableWaitGroup) bool { return true })


It isn't so much about testing initRestore, as about making sure there are no deadlocks, or we don't assume that syncWithK8sFinished is always invoked (e.g., standalone LB).

giorio94 · 2023-07-24T07:55:44Z

Do we want this logic to execute in case of standalone LB?

Today, the standalone LB mode doesn't assume any connectivity to kube-apiserver. Services are programmed through Cilium API.

Right, so should this logic be skipped for standalone LB case?

AFAIU, this logic is skipped when running in standalone LB mode, because the k8s client is not enabled in that case, which is a requirement for the k8s to lb maps synchronization. More in general, the synchronization logic was already present, and the changes in this PR only extend it with an additional cleanup step, but do not modify how it is triggered.

Do you also want to mention the ipv{4, 6} case in the commit description that Martynas linked above?

Sure, the updated commit message mentions that the issue can also occur in case of dual stack services, and links to the Martynas's issue.

giorio94 · 2023-07-24T09:09:18Z

Further refactoring and related tests are tracked by #27012. Marking as ready to merge.

giorio94 requested review from a team as code owners July 19, 2023 07:29

giorio94 requested review from ysksuzuki and aditighag July 19, 2023 07:29

maintainer-s-little-helper bot added this to Needs backport from main in 1.14.0 Jul 19, 2023

maintainer-s-little-helper bot added this to Needs backport from main in 1.13.5 Jul 19, 2023

giorio94 force-pushed the mio/service-obsolete-backends branch from f4e0c20 to 3c7f68b Compare July 20, 2023 07:30

aanm added affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch labels Jul 20, 2023

joestringer added the release-blocker/1.14 This issue will prevent the release of the next version of Cilium. label Jul 20, 2023

christarazi approved these changes Jul 20, 2023

View reviewed changes

aditighag requested changes Jul 20, 2023

View reviewed changes

ysksuzuki requested changes Jul 21, 2023

View reviewed changes

pkg/service/service.go Outdated Show resolved Hide resolved

giorio94 force-pushed the mio/service-obsolete-backends branch from 3c7f68b to 3ccc5b6 Compare July 21, 2023 09:18

giorio94 requested a review from ysksuzuki July 21, 2023 09:20

giorio94 added affects/v1.11 This issue affects v1.11 branch affects/v1.12 This issue affects v1.12 branch labels Jul 21, 2023

aditighag approved these changes Jul 21, 2023

View reviewed changes

maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Jul 21, 2023

aditighag removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 21, 2023

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 24, 2023

giorio94 removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 24, 2023

giorio94 mentioned this pull request Jul 24, 2023

Refactor stale services GC to leverage an hive job #27012

Open

giorio94 added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 24, 2023

aanm merged commit fe4dda7 into cilium:main Jul 24, 2023
65 checks passed

nbusseneau mentioned this pull request Jul 24, 2023

v1.13 Backports 2023-07-24 #27036

Merged

8 tasks

nbusseneau added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Jul 24, 2023

maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.13 in 1.13.5 Jul 24, 2023

nbusseneau mentioned this pull request Jul 24, 2023

v1.14 Backports 2023-07-24 #27038

Merged

21 tasks

aanm moved this from Needs backport from main to Backport done to v1.14 in 1.14.0 Jul 26, 2023

gentoo-root added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Jul 26, 2023

gentoo-root moved this from Backport pending to v1.13 to Backport done to v1.13 in 1.13.5 Jul 26, 2023

gentoo-root mentioned this pull request Jul 26, 2023

Prepare for release v1.13.5 #27088

Merged

giorio94 deleted the mio/service-obsolete-backends branch August 16, 2023 07:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

services: prevent temporary connectivity loss on agent restart #26912

services: prevent temporary connectivity loss on agent restart #26912

giorio94 commented Jul 19, 2023 •

edited

giorio94 commented Jul 20, 2023

giorio94 commented Jul 20, 2023

brb commented Jul 20, 2023

christarazi left a comment

christarazi Jul 20, 2023

giorio94 Jul 21, 2023

christarazi Jul 21, 2023 •

edited

aditighag left a comment

aditighag Jul 20, 2023

giorio94 Jul 21, 2023

aditighag Jul 21, 2023

giorio94 Jul 24, 2023

ysksuzuki left a comment •

edited

brb commented Jul 21, 2023

giorio94 commented Jul 21, 2023

brb commented Jul 21, 2023

giorio94 commented Jul 21, 2023

giorio94 commented Jul 21, 2023 •

edited

giorio94 commented Jul 21, 2023

giorio94 commented Jul 21, 2023

giorio94 commented Jul 21, 2023

aditighag left a comment

aditighag Jul 21, 2023

giorio94 commented Jul 24, 2023

giorio94 commented Jul 24, 2023

services: prevent temporary connectivity loss on agent restart #26912

services: prevent temporary connectivity loss on agent restart #26912

Conversation

giorio94 commented Jul 19, 2023 • edited

giorio94 commented Jul 20, 2023

giorio94 commented Jul 20, 2023

brb commented Jul 20, 2023

christarazi left a comment

Choose a reason for hiding this comment

christarazi Jul 20, 2023

Choose a reason for hiding this comment

giorio94 Jul 21, 2023

Choose a reason for hiding this comment

christarazi Jul 21, 2023 • edited

Choose a reason for hiding this comment

aditighag left a comment

Choose a reason for hiding this comment

aditighag Jul 20, 2023

Choose a reason for hiding this comment

giorio94 Jul 21, 2023

Choose a reason for hiding this comment

aditighag Jul 21, 2023

Choose a reason for hiding this comment

giorio94 Jul 24, 2023

Choose a reason for hiding this comment

ysksuzuki left a comment • edited

Choose a reason for hiding this comment

brb commented Jul 21, 2023

giorio94 commented Jul 21, 2023

brb commented Jul 21, 2023

giorio94 commented Jul 21, 2023

giorio94 commented Jul 21, 2023 • edited

giorio94 commented Jul 21, 2023

giorio94 commented Jul 21, 2023

giorio94 commented Jul 21, 2023

aditighag left a comment

Choose a reason for hiding this comment

aditighag Jul 21, 2023

Choose a reason for hiding this comment

giorio94 commented Jul 24, 2023

giorio94 commented Jul 24, 2023

giorio94 commented Jul 19, 2023 •

edited

christarazi Jul 21, 2023 •

edited

ysksuzuki left a comment •

edited

giorio94 commented Jul 21, 2023 •

edited