Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

services: prevent temporary connectivity loss on agent restart #26912

Merged
merged 1 commit into from
Jul 24, 2023

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented Jul 19, 2023

Cilium already implements a restore path to prevent dropping existing connections on agent restart. Yet, there's currently an issue which causes the removal of valid backends from a service when receiving incomplete service updates, either because backends are spread across multiple endpointslices or some belong to remote clusters. Indeed, all previously known backends get replaced with the ones we just heard about (and present as part of the service cache event), possibly causing connectivity disruptions.

Let's prevent this behavior keeping a list of restored backends for each service, and continuing merging them with the ones we received an update for, until the bootstrap phase completes. After synchronization, an update is triggered for each service still associated with stale backends, so that they can be removed.

Fixes: #23823
Fixes: #26944

Fix possible connection drops on agents restart when a service is associated with multiple endpointslices or has backends across multiple clusters

@giorio94 giorio94 added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. area/clustermesh Relates to multi-cluster routing functionality in Cilium. needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch area/loadbalancing Impacts load-balancing and Kubernetes service implementations needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Jul 19, 2023
@giorio94 giorio94 requested review from a team as code owners July 19, 2023 07:29
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.14.0 Jul 19, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.13.5 Jul 19, 2023
@giorio94
Copy link
Member Author

/test

@giorio94 giorio94 force-pushed the mio/service-obsolete-backends branch from f4e0c20 to 3c7f68b Compare July 20, 2023 07:30
@giorio94
Copy link
Member Author

/test

@aanm aanm added affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch labels Jul 20, 2023
@brb
Copy link
Member

brb commented Jul 20, 2023

This PR resolves #26944.

@joestringer joestringer added the release-blocker/1.14 This issue will prevent the release of the next version of Cilium. label Jul 20, 2023
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good find!

Minor comments below.

pkg/service/service.go Outdated Show resolved Hide resolved
@@ -1072,7 +1077,27 @@ func (s *Service) restoreAndDeleteOrphanSourceRanges() error {
//
// The removal is based on an assumption that during the sync period
// UpsertService() is going to be called for each alive service.
func (s *Service) SyncWithK8sFinished() error {
func (s *Service) SyncWithK8sFinished(ensurer func(k8s.ServiceID, *lock.StoppableWaitGroup) bool) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider making this function return the set of services and then call ensurer on the return of SyncWithK8sFinished instead? Might make it easier to understand what ensurer is for. Anonymous functions can make the code a bit more convenient to write, but hinders readability. This is a minor comment though because there's only one call site of SyncWithK8sFinished, so mostly just wondering what you've considered.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't have a strong opinion here. IMO the current approach has the advantage that the entire synchronization logic is self-contained in a single function, which is then easier to be tested. Another possibility might be to propagate the entire service cache (rather than the single function) as parameter (but that would make testing harder), or replace the anonymous function with an interface (I initially discarded this option as being more verbose, but it might improve clarity). WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that from an offline conversation about this function we talked about it being executed in a controller although it's one-off, maybe we just defer the refactoring of this function anyway (into a hive job or something). I'm fine as it is.

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find! Fix looks good to me overall with some requested changes.

Do we want this logic to execute in case of standalone LB?

removal of valid backends from a service when receiving incomplete service updates, either because backends are spread across multiple endpointslices or some belong to remote clusters.

I wonder how this issue go unnoticed until now. Is there a scale aspect to it, or did something change on k8s side?
I'm not too familiar with clustermesh, so this might be a newbie question. But does clustermesh run multiple instances of a k8s api server? If not, why would it matter that backends belong to remote clusters?

pkg/service/service.go Outdated Show resolved Hide resolved
pkg/service/service.go Outdated Show resolved Hide resolved
@@ -611,7 +615,7 @@ func (m *ManagerTestSuite) TestSyncWithK8sFinished(c *C) {

// cilium-agent finished the initialization, and thus SyncWithK8sFinished
// is called
err = m.svc.SyncWithK8sFinished()
err = m.svc.SyncWithK8sFinished(func(k8s.ServiceID, *lock.StoppableWaitGroup) bool { return true })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SyncWithK8sFinished is called conditionally (see initRestore). Can we add one more test with restore flag disabled, just to make sure we don't have any oops moments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a few attempts here, but I couldn't find any good way to test this condition without a large refactoring. The main issue is that the synchronization logic is triggered asynchronously through a controller, hence making it very difficult to wait for its termination. Given that the initRestore function is already untested (hence there's no check that the other operations are also skipped when RestoreStatus is unset) and the condition is pretty trivial, I'd personally skip the implementation of this test for the moment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't so much about testing initRestore, as about making sure there are no deadlocks, or we don't assume that syncWithK8sFinished is always invoked (e.g., standalone LB).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that logic is already triggered only in case that RestoreState is set, and the clientset is enabled (which excludes the standalone LB case). It is currently tricky to further test these aspects, because the syncWithKsFinished function (which by itself is covered by unit tests) is run asynchronously.

I'd propose to defer the introduction of additional tests to a subsequent refactoring which extracts this logic to be executed by an hive job, as mentioned by Chris. I'd not include the refactoring in this PR though, to reduce churn in older versions and because that would complicate backporting given that hive jobs support has only been introduced recently.

Copy link
Member

@ysksuzuki ysksuzuki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario in which the problem occurs is as follows, and this PR fixes this problem by delaying the deletion of the restored backends until the synchronization with kube-apiserver is complete. Is my understanding correct?

  • Restore services and backends from the BPF map
  • Sync with kube-apiserver and reflect the actual state
    • If the k8sServiceHandler receives an incomplete event like the following(event1 is incomplete), a valid backend is removed prematurely
      • event1{service1: backend1 } <- a valid backend2 is removed (the informer cache sync delays, perhaps?)
      • event2{service1: backend1, backend2 } <- the removed backend2 is back

Is this scenario that the handler receives an incomplete event likely to happen if the backends spread across multiple endpointslices or some belong to remote clusters?

pkg/service/service.go Outdated Show resolved Hide resolved
@brb
Copy link
Member

brb commented Jul 21, 2023

Do we want this logic to execute in case of standalone LB?

Today, the standalone LB mode doesn't assume any connectivity to kube-apiserver. Services are programmed through Cilium API.

@giorio94
Copy link
Member Author

I wonder how this issue go unnoticed until now. Is there a scale aspect to it, or did something change on k8s side?

Yes, in the sense that this issue doesn't occur in case all the backends are contained in a single endpointslice (given that we see only a single atomic update). By default, an endpointslice contains 100 endpoints, hence Cilium may (depending on events ordering this might also not occur) currently remove valid backends only when a service is associated with more than 100 pods.

I'm not too familiar with clustermesh, so this might be a newbie question. But does clustermesh run multiple instances of a k8s api server? If not, why would it matter that backends belong to remote clusters?

Essentially, each cluster exposes an etcd cluster and synchronizes there a subset of the information available as Kubernetes resources, including the list of the services which are marked as shared, and the associated backends. Remote agents connect to these etcd clusters to pull the various objects and merge them with the local state. In the services case, the remote backends get merged into the service cache, as if they were just a different endpointslice. I personally discovered the issue in this scenario, as a single backend is enough to trigger the connectivity drop (opposed to the local case).

@brb
Copy link
Member

brb commented Jul 21, 2023

I wonder how this issue go unnoticed until now. Is there a scale aspect to it, or did something change on k8s side?

How I hit this - add DualStack services, and then do Cilium upgrade.

@giorio94
Copy link
Member Author

The scenario in which the problem occurs is as follows, and this PR fixes this problem by delaying the deletion of the restored backends until the synchronization with kube-apiserver is complete. Is my understanding correct?

* Restore services and backends from the BPF map

* Sync with kube-apiserver and reflect the actual state
  
  * If the k8sServiceHandler receives an incomplete event like the following(event1 is incomplete), a valid backend is removed prematurely
    
    * event1{service1: backend1 } <- a valid backend2 is removed (the informer cache sync delays, perhaps?)
    * event2{service1: backend1, backend2 } <- the removed backend2 is back

Exactly. Essentially, consider the case in which a given service foo is associated with the foo-1 and foo-2 epslices each containing a set of backends.

As you mentioned, upon restart, Cilium restores services and backends from the BPF map. Then it starts the service and epslice informers, both of which propagate the event to the service cache. Let's say that we first receive the event about the service: at this point the service is considered not ready (as we have not yet seen any epslice), and nothing gets propagated. Then, we receive the event for the foo-1 epslice, the service cache processes it and, given that the service has now backends, propagates an event down to the service subsystem for the service foo, including all backends part of foo-1. At this point, all previously known backends get replaced by the new ones in the BPF maps, dropping the connections targeting the backends that were part of the foo-2 epslice. Once an event for that epslice is also seen, then the backends will be merged and restored.

The only case in which this issue doesn't happen is when we first receive both foo-1 and foo-2 events from kubernetes, and only afterwards the one for the service itself.

The modification implemented in this PR ensures that in the service subsystem we continue preserving the backends restored from the BPF maps also once we receive the first service update, given that at that point some backends might still be missing.

Is this scenario that the handler receives an incomplete event likely to happen if the backends spread across multiple endpointslices or some belong to remote clusters?

Yes, it doesn't happen in case of a single endpointslice, because all backends would be part of a single event atomically received from the k8s informer. And the clustermesh case essentially corresponds to having another epslice for each remote cluster.

@giorio94
Copy link
Member Author

giorio94 commented Jul 21, 2023

How I hit this - add DualStack services, and then do Cilium upgrade.

TIL that endpointslices are created per family. Hence, if the service is of type dualstack, we'll have two separate endpointslices (regardless of whether pods are dualstack or not). Hence, causing the same issue.

@giorio94 giorio94 force-pushed the mio/service-obsolete-backends branch from 3c7f68b to 3ccc5b6 Compare July 21, 2023 09:18
@giorio94
Copy link
Member Author

/test

@giorio94 giorio94 requested a review from ysksuzuki July 21, 2023 09:20
@giorio94
Copy link
Member Author

/test

@giorio94 giorio94 added affects/v1.11 This issue affects v1.11 branch affects/v1.12 This issue affects v1.12 branch labels Jul 21, 2023
@giorio94
Copy link
Member Author

I've double-checked, and this issue additionally affects v1.11 (hence also v1.12).

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this logic to execute in case of standalone LB?

Today, the standalone LB mode doesn't assume any connectivity to kube-apiserver. Services are programmed through Cilium API.

Right, so should this logic be skipped for standalone LB case?

Do you also want to mention the ipv{4, 6} case in the commit description that Martynas linked above?

Approving the change, as I'll be on PTO next week. Please check on unit testing the standalone case where syncWithK8sFinished is a no-op. Thanks!

@@ -611,7 +615,7 @@ func (m *ManagerTestSuite) TestSyncWithK8sFinished(c *C) {

// cilium-agent finished the initialization, and thus SyncWithK8sFinished
// is called
err = m.svc.SyncWithK8sFinished()
err = m.svc.SyncWithK8sFinished(func(k8s.ServiceID, *lock.StoppableWaitGroup) bool { return true })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't so much about testing initRestore, as about making sure there are no deadlocks, or we don't assume that syncWithK8sFinished is always invoked (e.g., standalone LB).

@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Jul 21, 2023
@aditighag aditighag removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 21, 2023
@giorio94
Copy link
Member Author

Do we want this logic to execute in case of standalone LB?

Today, the standalone LB mode doesn't assume any connectivity to kube-apiserver. Services are programmed through Cilium API.

Right, so should this logic be skipped for standalone LB case?

AFAIU, this logic is skipped when running in standalone LB mode, because the k8s client is not enabled in that case, which is a requirement for the k8s to lb maps synchronization. More in general, the synchronization logic was already present, and the changes in this PR only extend it with an additional cleanup step, but do not modify how it is triggered.

Do you also want to mention the ipv{4, 6} case in the commit description that Martynas linked above?

Sure, the updated commit message mentions that the issue can also occur in case of dual stack services, and links to the Martynas's issue.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 24, 2023
@giorio94 giorio94 removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 24, 2023
@giorio94
Copy link
Member Author

Further refactoring and related tests are tracked by #27012. Marking as ready to merge.

@giorio94 giorio94 added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 24, 2023
@aanm aanm merged commit fe4dda7 into cilium:main Jul 24, 2023
65 checks passed
@nbusseneau nbusseneau mentioned this pull request Jul 24, 2023
8 tasks
@nbusseneau nbusseneau added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Jul 24, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.13 in 1.13.5 Jul 24, 2023
@nbusseneau nbusseneau mentioned this pull request Jul 24, 2023
21 tasks
@nbusseneau nbusseneau added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Jul 24, 2023
@aanm aanm moved this from Needs backport from main to Backport done to v1.14 in 1.14.0 Jul 26, 2023
@gentoo-root gentoo-root added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Jul 26, 2023
@gentoo-root gentoo-root moved this from Backport pending to v1.13 to Backport done to v1.13 in 1.13.5 Jul 26, 2023
@giorio94 giorio94 deleted the mio/service-obsolete-backends branch August 16, 2023 07:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.11 This issue affects v1.11 branch affects/v1.12 This issue affects v1.12 branch affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch area/clustermesh Relates to multi-cluster routing functionality in Cilium. area/loadbalancing Impacts load-balancing and Kubernetes service implementations backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-blocker/1.14 This issue will prevent the release of the next version of Cilium. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.13.5
Backport done to v1.13
1.14.0
Backport done to v1.14
9 participants