-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cilium status: additionally check for endpoints readiness #2298
cilium status: additionally check for endpoints readiness #2298
Conversation
2c3cc00
to
82b090d
Compare
Introduce an additional step to the cilium status command to check (and optionally wait) for endpoints readiness. This check is particularly relevant in CI to ensure that the datapath of already existing endpoints has been fully updated (i.e., restoration and regeneration phases have completed) after agent restart, before proceeding with the subsequent tests. Otherwise, we may continue to incorrectly exercise the old datapath and miss possible issues, especially concerning dropped connections and missed tail calls. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
82b090d
to
b9abd17
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @giorio94!
I have some concerns introducing a special treatment for the "endpoints" whereas other subsystems are handled via the cilium-dbg status
. See my comments inline.
@@ -445,6 +445,7 @@ type k8sClusterMeshImplementation interface { | |||
ListPods(ctx context.Context, namespace string, options metav1.ListOptions) (*corev1.PodList, error) | |||
AutodetectFlavor(ctx context.Context) k8s.Flavor | |||
CiliumStatus(ctx context.Context, namespace, pod string) (*models.StatusResponse, error) | |||
CiliumDbgEndpoints(ctx context.Context, namespace, pod string) ([]*models.Endpoint, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the name CiliumDbgEndpoints
isn't ideal as there's no context what the dbg
stands for . Mainly due to the fact that we're still using the binary cilium
instead of cilium-dbg
for backwards compatibility. Nevertheless i get the point that it mainly to differentiate it from the rest of the more k8s related API. But we don't do it for CiliumStatus
either.
In general i don't get why the "k8s client" also serves the purpose of providing information that are gathered via cilium-dbg
(status & now endpoints). Even though kubectl exec
is used, this should be in it's own struct (that depends on the k8s client to execute ExecInPod
)
But we can leave this for an upcoming PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the name CiliumDbgEndpoints isn't ideal as there's no context what the dbg stands for
Yeah, I agree it is suboptimal. I couldn't find a much better name honestly though, to make it distinguishable from ListCiliumEndpoints
which retrieves the CRD resources. One of my initial attempts was CiliumStatusEndpoints
, but that doesn't look great either (especially because it is not status specific).
In general i don't get why the "k8s client" also serves the purpose of providing information that are gathered via cilium-dbg
Totally agree. I also don't get the point of these huge interfaces with the different methods just aliasing the k8s client functionalities. Even for unit testing (which we don't do that much for the CLI in any case) it is just way easier to use a fake client compared to implementing a mock of these interfaces. But that's how it is right now, and any changes seem out of scope (especially because CiliumStatus
is already part of it).
PodState: PodStateMap{}, | ||
PodsCount: PodsCount{}, | ||
CiliumStatus: CiliumStatusMap{}, | ||
CiliumEndpoints: CiliumEndpointsMap{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until now, the Cilium CLI status checked for K8s resources and the Cilium status. Part of checking the Cilium status (gathered via cilium-dbg
) is that the status of various subsystems gets checked.
Wouldn't it be possible to handle the endpoints the same way? by exposing this information via the cilium-dbg status
(in its subsystem) and handle it that way? (Either this info is already exposed or we could move the current logic from CiliumDbgEndpoints
& parseEndpointsResponse
into the agent.) This would lead to fewer changes in the CIlium CLI by handling all modules the same way).
Have you already thought about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that has been my initial attempt, but unfortunately it appears that cilium status
doesn't currently expose any information about the termination of endpoints regeneration. I was on the fence about whether to introduce that on the agent or implement the functionality in the CLI, and at the end I preferred the latter mainly for convenience, given that it works with any version. Extending the status instead would only apply to the latest Cilium version, or require to be backported to stable branches with the associated risks/complexity. I don't have a strong opinion though, and I'm open to switch to the other approach if there's consensus on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I raised similar concerns in another PR. Even though the main discussion was about using a structured output (JSON instead of text) - the aspect of having to implement and backport the functionality in the Cilium Agent was the same.
With the same argument that backporting of new functionality should be straightforward and a non-blocker. Especially if it helps to prevent flakes in CI on these stable branches too.
Even though slightly different, I though it's fair to raise this point here as well. what's the opinion of the maintainers? (@tklauser @michi-covalent )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought a bit more about this, and the main thing that to me feels weird about adding it to the general status is that we would have a separate Endpoint
section with only the RestorationCompleted
information, while all the rest of endpoint-related information can be retrieved through cilium endpoint list
. Backporting such change would be also likely easy, but not necessarily trivial (in the sense of zero conflicts), as there have been a few changes over time in terms of how we track the completion of the regeneration of endpoints.
WDYT? Does it sound reasonable to you to keep the current CLI-only approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, the one (having a status for a subsystem - even if it's just the first status-info of this subsystem (restoration completed)) should not exclude the other (retrieving detailed information from / interacting with the subsystem) 🤷
The status should provide an initial overview of the system with its subsystems (and outline potential issues)- and not replace other commands (that are more detailed) - that are used to further investigate potential issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I agree with the point raised by @mhofstetter but I think in this particular case the costs of changing cilium-dbg
and backporting to all stable branches doesn't justify the cleaner implementation. I'd tend to say that it's fine to keep the current CLI-only approach for the sake of this PR and maybe consider changing to the cleaner approach as a follow-up.
@youngnick Gentle ping 🙏 |
CI is green, reviews covering all codeowners are in. Marking ready to merge. |
Introduce an additional step to the cilium status command to check
(and optionally wait) for endpoints readiness. This check is
particularly relevant in CI to ensure that the datapath of already
existing endpoints has been fully updated (i.e., restoration and
regeneration phases have completed) after agent restart, before
proceeding with the subsequent tests. Otherwise, we may continue
to incorrectly exercise the old datapath and miss possible issues,
especially concerning dropped connections and missed tail calls.