cilium status: additionally check for endpoints readiness #2298

giorio94 · 2024-02-12T15:28:26Z

Introduce an additional step to the cilium status command to check
(and optionally wait) for endpoints readiness. This check is
particularly relevant in CI to ensure that the datapath of already
existing endpoints has been fully updated (i.e., restoration and
regeneration phases have completed) after agent restart, before
proceeding with the subsequent tests. Otherwise, we may continue
to incorrectly exercise the old datapath and miss possible issues,
especially concerning dropped connections and missed tail calls.

Introduce an additional step to the cilium status command to check (and optionally wait) for endpoints readiness. This check is particularly relevant in CI to ensure that the datapath of already existing endpoints has been fully updated (i.e., restoration and regeneration phases have completed) after agent restart, before proceeding with the subsequent tests. Otherwise, we may continue to incorrectly exercise the old datapath and miss possible issues, especially concerning dropped connections and missed tail calls. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

k8s/client.go

status/status.go

mhofstetter

Thanks @giorio94!

I have some concerns introducing a special treatment for the "endpoints" whereas other subsystems are handled via the cilium-dbg status. See my comments inline.

mhofstetter · 2024-02-16T11:49:57Z

clustermesh/clustermesh.go

@@ -445,6 +445,7 @@ type k8sClusterMeshImplementation interface {
 	ListPods(ctx context.Context, namespace string, options metav1.ListOptions) (*corev1.PodList, error)
 	AutodetectFlavor(ctx context.Context) k8s.Flavor
 	CiliumStatus(ctx context.Context, namespace, pod string) (*models.StatusResponse, error)
+	CiliumDbgEndpoints(ctx context.Context, namespace, pod string) ([]*models.Endpoint, error)


IMO the name CiliumDbgEndpoints isn't ideal as there's no context what the dbg stands for . Mainly due to the fact that we're still using the binary cilium instead of cilium-dbg for backwards compatibility. Nevertheless i get the point that it mainly to differentiate it from the rest of the more k8s related API. But we don't do it for CiliumStatus either.

In general i don't get why the "k8s client" also serves the purpose of providing information that are gathered via cilium-dbg (status & now endpoints). Even though kubectl exec is used, this should be in it's own struct (that depends on the k8s client to execute ExecInPod)

But we can leave this for an upcoming PR.

IMO the name CiliumDbgEndpoints isn't ideal as there's no context what the dbg stands for

Yeah, I agree it is suboptimal. I couldn't find a much better name honestly though, to make it distinguishable from ListCiliumEndpoints which retrieves the CRD resources. One of my initial attempts was CiliumStatusEndpoints, but that doesn't look great either (especially because it is not status specific).

In general i don't get why the "k8s client" also serves the purpose of providing information that are gathered via cilium-dbg

Totally agree. I also don't get the point of these huge interfaces with the different methods just aliasing the k8s client functionalities. Even for unit testing (which we don't do that much for the CLI in any case) it is just way easier to use a fake client compared to implementing a mock of these interfaces. But that's how it is right now, and any changes seem out of scope (especially because CiliumStatus is already part of it).

mhofstetter · 2024-02-16T11:59:00Z

status/status.go

+		PodState:        PodStateMap{},
+		PodsCount:       PodsCount{},
+		CiliumStatus:    CiliumStatusMap{},
+		CiliumEndpoints: CiliumEndpointsMap{},


Until now, the Cilium CLI status checked for K8s resources and the Cilium status. Part of checking the Cilium status (gathered via cilium-dbg) is that the status of various subsystems gets checked.

Wouldn't it be possible to handle the endpoints the same way? by exposing this information via the cilium-dbg status (in its subsystem) and handle it that way? (Either this info is already exposed or we could move the current logic from CiliumDbgEndpoints & parseEndpointsResponse into the agent.) This would lead to fewer changes in the CIlium CLI by handling all modules the same way).

Have you already thought about that?

Yes, that has been my initial attempt, but unfortunately it appears that cilium status doesn't currently expose any information about the termination of endpoints regeneration. I was on the fence about whether to introduce that on the agent or implement the functionality in the CLI, and at the end I preferred the latter mainly for convenience, given that it works with any version. Extending the status instead would only apply to the latest Cilium version, or require to be backported to stable branches with the associated risks/complexity. I don't have a strong opinion though, and I'm open to switch to the other approach if there's consensus on it.

I raised similar concerns in another PR. Even though the main discussion was about using a structured output (JSON instead of text) - the aspect of having to implement and backport the functionality in the Cilium Agent was the same.

With the same argument that backporting of new functionality should be straightforward and a non-blocker. Especially if it helps to prevent flakes in CI on these stable branches too.

Even though slightly different, I though it's fair to raise this point here as well. what's the opinion of the maintainers? (@tklauser @michi-covalent )

I thought a bit more about this, and the main thing that to me feels weird about adding it to the general status is that we would have a separate Endpoint section with only the RestorationCompleted information, while all the rest of endpoint-related information can be retrieved through cilium endpoint list. Backporting such change would be also likely easy, but not necessarily trivial (in the sense of zero conflicts), as there have been a few changes over time in terms of how we track the completion of the regeneration of endpoints.

WDYT? Does it sound reasonable to you to keep the current CLI-only approach?

IMO, the one (having a status for a subsystem - even if it's just the first status-info of this subsystem (restoration completed)) should not exclude the other (retrieving detailed information from / interacting with the subsystem) 🤷

The status should provide an initial overview of the system with its subsystems (and outline potential issues)- and not replace other commands (that are more detailed) - that are used to further investigate potential issues.

In general I agree with the point raised by @mhofstetter but I think in this particular case the costs of changing cilium-dbg and backporting to all stable branches doesn't justify the cleaner implementation. I'd tend to say that it's fine to keep the current CLI-only approach for the sake of this PR and maybe consider changing to the cleaner approach as a follow-up.

giorio94 · 2024-02-23T08:23:03Z

@youngnick Gentle ping 🙏

giorio94 · 2024-02-26T08:27:57Z

CI is green, reviews covering all codeowners are in. Marking ready to merge.

giorio94 temporarily deployed to ci February 12, 2024 15:28 — with GitHub Actions Inactive

giorio94 force-pushed the mio/wait-for-endpoint-restoration branch from 2c3cc00 to 82b090d Compare February 12, 2024 15:34

giorio94 temporarily deployed to ci February 12, 2024 15:34 — with GitHub Actions Inactive

giorio94 force-pushed the mio/wait-for-endpoint-restoration branch from 82b090d to b9abd17 Compare February 13, 2024 09:01

giorio94 temporarily deployed to ci February 13, 2024 09:01 — with GitHub Actions Inactive

giorio94 changed the title ~~cilium status: additionally check for endpoint restoration~~ cilium status: additionally check for endpoints readiness Feb 13, 2024

giorio94 marked this pull request as ready for review February 13, 2024 10:12

giorio94 requested review from a team as code owners February 13, 2024 10:12

giorio94 requested review from youngnick, mhofstetter, gandro and YutaroHayakawa February 13, 2024 10:12

gandro approved these changes Feb 14, 2024

View reviewed changes

YutaroHayakawa reviewed Feb 16, 2024

View reviewed changes

k8s/client.go Show resolved Hide resolved

status/status.go Show resolved Hide resolved

YutaroHayakawa approved these changes Feb 16, 2024

View reviewed changes

mhofstetter reviewed Feb 16, 2024

View reviewed changes

mhofstetter approved these changes Feb 21, 2024

View reviewed changes

aanm approved these changes Feb 26, 2024

View reviewed changes

giorio94 added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 26, 2024

sayboras merged commit 7f1e32e into cilium:main Feb 26, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cilium status: additionally check for endpoints readiness #2298

cilium status: additionally check for endpoints readiness #2298

giorio94 commented Feb 12, 2024 •

edited

mhofstetter left a comment

mhofstetter Feb 16, 2024

giorio94 Feb 16, 2024

mhofstetter Feb 16, 2024 •

edited

giorio94 Feb 16, 2024 •

edited

mhofstetter Feb 16, 2024 •

edited

giorio94 Feb 21, 2024

mhofstetter Feb 21, 2024

tklauser Feb 21, 2024

giorio94 commented Feb 23, 2024

giorio94 commented Feb 26, 2024

cilium status: additionally check for endpoints readiness #2298

cilium status: additionally check for endpoints readiness #2298

Conversation

giorio94 commented Feb 12, 2024 • edited

mhofstetter left a comment

Choose a reason for hiding this comment

mhofstetter Feb 16, 2024

Choose a reason for hiding this comment

giorio94 Feb 16, 2024

Choose a reason for hiding this comment

mhofstetter Feb 16, 2024 • edited

Choose a reason for hiding this comment

giorio94 Feb 16, 2024 • edited

Choose a reason for hiding this comment

mhofstetter Feb 16, 2024 • edited

Choose a reason for hiding this comment

giorio94 Feb 21, 2024

Choose a reason for hiding this comment

mhofstetter Feb 21, 2024

Choose a reason for hiding this comment

tklauser Feb 21, 2024

Choose a reason for hiding this comment

giorio94 commented Feb 23, 2024

giorio94 commented Feb 26, 2024

giorio94 commented Feb 12, 2024 •

edited

mhofstetter Feb 16, 2024 •

edited

giorio94 Feb 16, 2024 •

edited

mhofstetter Feb 16, 2024 •

edited