Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cilium status: additionally check for endpoints readiness #2298

Merged
merged 1 commit into from
Feb 26, 2024

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented Feb 12, 2024

Introduce an additional step to the cilium status command to check
(and optionally wait) for endpoints readiness. This check is
particularly relevant in CI to ensure that the datapath of already
existing endpoints has been fully updated (i.e., restoration and
regeneration phases have completed) after agent restart, before
proceeding with the subsequent tests. Otherwise, we may continue
to incorrectly exercise the old datapath and miss possible issues,
especially concerning dropped connections and missed tail calls.

Introduce an additional step to the cilium status command to check
(and optionally wait) for endpoints readiness. This check is
particularly relevant in CI to ensure that the datapath of already
existing endpoints has been fully updated (i.e., restoration and
regeneration phases have completed) after agent restart, before
proceeding with the subsequent tests. Otherwise, we may continue
to incorrectly exercise the old datapath and miss possible issues,
especially concerning dropped connections and missed tail calls.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the mio/wait-for-endpoint-restoration branch from 82b090d to b9abd17 Compare February 13, 2024 09:01
@giorio94 giorio94 changed the title cilium status: additionally check for endpoint restoration cilium status: additionally check for endpoints readiness Feb 13, 2024
@giorio94 giorio94 marked this pull request as ready for review February 13, 2024 10:12
@giorio94 giorio94 requested review from a team as code owners February 13, 2024 10:12
k8s/client.go Show resolved Hide resolved
status/status.go Show resolved Hide resolved
Copy link
Member

@mhofstetter mhofstetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @giorio94!

I have some concerns introducing a special treatment for the "endpoints" whereas other subsystems are handled via the cilium-dbg status. See my comments inline.

@@ -445,6 +445,7 @@ type k8sClusterMeshImplementation interface {
ListPods(ctx context.Context, namespace string, options metav1.ListOptions) (*corev1.PodList, error)
AutodetectFlavor(ctx context.Context) k8s.Flavor
CiliumStatus(ctx context.Context, namespace, pod string) (*models.StatusResponse, error)
CiliumDbgEndpoints(ctx context.Context, namespace, pod string) ([]*models.Endpoint, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the name CiliumDbgEndpoints isn't ideal as there's no context what the dbg stands for . Mainly due to the fact that we're still using the binary cilium instead of cilium-dbg for backwards compatibility. Nevertheless i get the point that it mainly to differentiate it from the rest of the more k8s related API. But we don't do it for CiliumStatus either.

In general i don't get why the "k8s client" also serves the purpose of providing information that are gathered via cilium-dbg (status & now endpoints). Even though kubectl exec is used, this should be in it's own struct (that depends on the k8s client to execute ExecInPod)

But we can leave this for an upcoming PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the name CiliumDbgEndpoints isn't ideal as there's no context what the dbg stands for

Yeah, I agree it is suboptimal. I couldn't find a much better name honestly though, to make it distinguishable from ListCiliumEndpoints which retrieves the CRD resources. One of my initial attempts was CiliumStatusEndpoints, but that doesn't look great either (especially because it is not status specific).

In general i don't get why the "k8s client" also serves the purpose of providing information that are gathered via cilium-dbg

Totally agree. I also don't get the point of these huge interfaces with the different methods just aliasing the k8s client functionalities. Even for unit testing (which we don't do that much for the CLI in any case) it is just way easier to use a fake client compared to implementing a mock of these interfaces. But that's how it is right now, and any changes seem out of scope (especially because CiliumStatus is already part of it).

PodState: PodStateMap{},
PodsCount: PodsCount{},
CiliumStatus: CiliumStatusMap{},
CiliumEndpoints: CiliumEndpointsMap{},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until now, the Cilium CLI status checked for K8s resources and the Cilium status. Part of checking the Cilium status (gathered via cilium-dbg) is that the status of various subsystems gets checked.

Wouldn't it be possible to handle the endpoints the same way? by exposing this information via the cilium-dbg status (in its subsystem) and handle it that way? (Either this info is already exposed or we could move the current logic from CiliumDbgEndpoints & parseEndpointsResponse into the agent.) This would lead to fewer changes in the CIlium CLI by handling all modules the same way).

Have you already thought about that?

Copy link
Member Author

@giorio94 giorio94 Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that has been my initial attempt, but unfortunately it appears that cilium status doesn't currently expose any information about the termination of endpoints regeneration. I was on the fence about whether to introduce that on the agent or implement the functionality in the CLI, and at the end I preferred the latter mainly for convenience, given that it works with any version. Extending the status instead would only apply to the latest Cilium version, or require to be backported to stable branches with the associated risks/complexity. I don't have a strong opinion though, and I'm open to switch to the other approach if there's consensus on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I raised similar concerns in another PR. Even though the main discussion was about using a structured output (JSON instead of text) - the aspect of having to implement and backport the functionality in the Cilium Agent was the same.

With the same argument that backporting of new functionality should be straightforward and a non-blocker. Especially if it helps to prevent flakes in CI on these stable branches too.

Even though slightly different, I though it's fair to raise this point here as well. what's the opinion of the maintainers? (@tklauser @michi-covalent )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought a bit more about this, and the main thing that to me feels weird about adding it to the general status is that we would have a separate Endpoint section with only the RestorationCompleted information, while all the rest of endpoint-related information can be retrieved through cilium endpoint list. Backporting such change would be also likely easy, but not necessarily trivial (in the sense of zero conflicts), as there have been a few changes over time in terms of how we track the completion of the regeneration of endpoints.

WDYT? Does it sound reasonable to you to keep the current CLI-only approach?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, the one (having a status for a subsystem - even if it's just the first status-info of this subsystem (restoration completed)) should not exclude the other (retrieving detailed information from / interacting with the subsystem) 🤷

The status should provide an initial overview of the system with its subsystems (and outline potential issues)- and not replace other commands (that are more detailed) - that are used to further investigate potential issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I agree with the point raised by @mhofstetter but I think in this particular case the costs of changing cilium-dbg and backporting to all stable branches doesn't justify the cleaner implementation. I'd tend to say that it's fine to keep the current CLI-only approach for the sake of this PR and maybe consider changing to the cleaner approach as a follow-up.

@giorio94
Copy link
Member Author

@youngnick Gentle ping 🙏

@giorio94
Copy link
Member Author

CI is green, reviews covering all codeowners are in. Marking ready to merge.

@giorio94 giorio94 added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 26, 2024
@sayboras sayboras merged commit 7f1e32e into cilium:main Feb 26, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants