Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix possible cross-cluster connection drops on agents restart when clustermesh is enabled #27611

Merged

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented Aug 21, 2023

Backported manually, due to a few conflicts and to additionally test it through #27232.

Successful clustermesh upgrade/downgrade run: https://github.com/cilium/cilium/actions/runs/6035564385/job/16376157031

@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.14 This PR represents a backport for Cilium 1.14.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. labels Aug 21, 2023
@giorio94 giorio94 force-pushed the mio/v1.14/fix-clustermesh-drops-on-restart branch from 5242b41 to d890bb9 Compare August 23, 2023 08:42
@giorio94
Copy link
Member Author

/test-backport-1.14

@giorio94 giorio94 force-pushed the mio/v1.14/fix-clustermesh-drops-on-restart branch from d890bb9 to 11a89b8 Compare August 31, 2023 09:08
@giorio94
Copy link
Member Author

/test-backport-1.14

@giorio94 giorio94 marked this pull request as ready for review August 31, 2023 10:20
@giorio94 giorio94 requested a review from a team as a code owner August 31, 2023 10:20
@michi-covalent michi-covalent added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Sep 8, 2023
[ upstream commit a7461fd ]

[ Backporter's notes: small conflicts due to the store factory not being
  present in v1.14. ]

Propagate extra options from the IPIdentityWatcher to the underlying
WatchStore, to allow configuring additional callback handlers executed
after the first synchronization, or register a metric tracking the
number of elements.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 8672bd5 ]

Add the possibility to specify an additional callback handler executed
when the initial list of identities has been retrieved from a given
remote kvstore.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 39beeb4 ]

[ Backporter's notes: small conflicts due to the store factory and
  the extra ipcache options injection logic not being present in v1.14,
  and the different name of the clustermesh/common package (was
  clustermesh/internal). ]

Generalize the logic currently used to wait until the initial list of
shared services has been received from all remote clusters, and
synchronized with the BPF datapath, to prepare for the subsequent
introduction of similar methods for other resources.

While being at it, let's also rename the function to clarify its scope
(as it only waits for the synchronization of services), and fix the
possibility of a deadlock in case a remote cluster is disconnected
before that the synchronization has completed.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 6c4e693 ]

[ Backporter's note: small conflicts due to the store factory and the
  extra ipcache options injection logic not being present in v1.14. ]

Add a new IPIdentitiesSynced method to the clustermesh object to allow
waiting until the initial list of ipcache entries and identities has
been received from all known remote clusters. This mimics the
ServicesSynced() method, and enables to postpone disruptive operations
to prevent dropping valid long-running connections on agent restart.

Differently from the services case, we don't need to separately
track the propagation of the single entries to the datapath, as
performed synchronously.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit ae3a48a ]

Add a new NodesSynced method to the clustermesh object to allow
waiting until the initial list of nodes has been received from
all known remote clusters. This mimics the ServicesSynced()
method, and enables to postpone disruptive operations to prevent
dropping valid long-running connections on agent restart.

Differently from the services case, we don't need to separately
track the propagation of the single entries to the datapath, as
performed synchronously.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 8de7707 ]

[ Backporter's notes: small conflicts due to the different fields in
  the daemonParams struct, and minor context differences in state.go. ]

A new ipcache map is recreated from scratch and populated by each cilium
agent upon restart, and the existing endpoints are switched to use it
during the regeneration process of the endpoint itself.

Before performing this operation, the new ipcache map needs to have
already been fully synchronized, otherwise we might drop long running
connections. Currently we wait for synchronization from k8s (when in CRD
mode) and the kvstore (when in kvstore mode). Yet, when clustermesh is
enabled, we don't wait for synchronization from all the remote clusters,
possibly disrupting existing cross-cluster connections. The same also
applies to identities in case network policies targeting the given
endpoint are present.

This commit introduces additional logic to wait for ipcache and
identities synchronization from remote clusters before triggering endpoint
regeneration. The wait period is bound by a user-configurable timeout,
defaulting to one minute. The idea being that we don't block the
local endpoints regeneration for an extended period of time if a
given remote cluster is not ready. This also prevents possible inter
dependency problems, with the regeneration in one cluster being
required to allow other clusters to successfully connect to it.

Nonetheless, the timeout can be configured through a dedicated
--clustermesh-ip-identities-sync-timeout flag, to select the desired
trade-off between possible endpoints regeneration delay and likelihood
of connection drops if a given remote clusters is not ready (e.g.,
because the corresponding clustermesh-apiserver is restarting).

A special case is represented by the host endpoint if IPSec is enabled,
because it is currently restored synchronously to ensure ordering.
Given that blocking this operation prevents the agent from becoming
ready, and in turn new pods from being scheduled, we forcefully
cap the wait timeout to 5 seconds. Empirically, this value has
proved to be sufficient to retrieve a high number of entries
if the remote kvstore is ready (e.g., the clustermesh-apiserver
pod is running, and already synchronized with Kubernetes).

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 6a8f5ab ]

[ Backporter's notes: small conflict due to different capitalization
  of the WireGuard term in a log message. ]

Currently, the WireGuard subsystem triggers the deletion of obsolete
peers when the daemon restoration process has completed. Yet, at
this point the information about nodes belonging to remote clusters
might not yet have been received, causing the disruption of cross-cluster
connections. Hence, let's first wait for the synchronization of all
remote nodes before kicking off the removal process.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the mio/v1.14/fix-clustermesh-drops-on-restart branch from 11a89b8 to 6fd082d Compare September 18, 2023 11:50
@giorio94
Copy link
Member Author

/test-backport-1.14

@giorio94 giorio94 removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Sep 18, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Sep 18, 2023
@ldelossa ldelossa merged commit 31060c9 into cilium:v1.14 Sep 19, 2023
57 checks passed
@giorio94 giorio94 deleted the mio/v1.14/fix-clustermesh-drops-on-restart branch September 28, 2023 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.14 This PR represents a backport for Cilium 1.14.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants