Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix machine-controller CNI setup for fully joining nodes to kind control plane #1462

Open
embik opened this issue Oct 14, 2022 · 0 comments
Open
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test.

Comments

@embik
Copy link
Member

embik commented Oct 14, 2022

This is a follow-up from the issues worked around in #1459.

Background

In #1304, we changed our CI setup to get rid of the Hetzner VMs built for each CI job and to mirror what we have been doing in KKP CI for some time now. The big difference between the KKP tests and the MC tests is that in KKP, the KKP control plane runs in kind only. In the MC tests, we join Machines to the kind control plane directly. Because of that, using the default CNI (kindnet) did not work and we opted for using flannel as CNI. This appeared to work fine, tests were passing, Nodes were marked as ready upon joining the kind control plane.

Current Problem

Recently, we upgraded our CI environment and the underlying container runtime switched from docker to containerd. That apparently broke the kind control plane using flannel as CNI. In specific, requests from machine-controller-webhook to cloud provider APIs failed, and after some investigation the problem appeared to be DNS resolution through the in-cluster DNS service IP. This only happens with flannel and it's not clear why, but our nested container setup is probably fairly unique with its problems.

So the idea was to replace the CNI. Both Calico and Cilium were tried. After some more investigation, the following underlying problem was identified in the test architecture:

The Kubernetes API is not accessible as in-cluster service from any of the nodes, because the Endpoint resource in the "kubernetes" Service points to a 172.16.0.0/16 address. That means calling the kubernetes.default.svc.cluster.local endpoint from a Pod on a Node joined to the kind control plane cannot work. Overriding the advertised IP address is also not possible, because the "kubernetes" Service is exposed as a NodePort to allow the whole cluster-exposer logic we do to make it accessible to Nodes in the first place. If you update the advertised IP to something publicly accessible, you create a loop (service endpoint points to public IP, public IP + Port point to "kubernetes" Service, which uses the public IP as endpoint, and so on).

CNI pods can therefore not talk to the Kubernetes API to properly initialise Nodes into the pod overlay network, and therefore Nodes cannot ever be ready, which is something we want to verify in machine-controller e2e tests (but will be removed via #1459).

Why is this not a problem with KKP tests?

For KKP tests, no Nodes are joined to the kind cluster. Instead, kind is used to host KKP user cluster control planes, which are built for this purpose and can be used from the outside without the same set of problems (because the Kubernetes API endpoints are routed directly to the kube-apiserver instances run for a user cluster control plane).

Why does this work with some CNIs?

Pretty good question, the gist seems to be that CNIs do the Node initialisation in different ways. Some CNIs appear to work but don't provide functional network. The differences in how CNI initialise seem to be the cause for different behaviour. But networking to services running on the kind control plane cannot work in the current setup, so we never had functional Nodes, even if they were marked as ready. Calico just uncovers the problem by crashing early.

How to solve

We need to solve exposing the control plane properly. There might be options to that with the current kind setup, but an alternative would be to launch a user cluster control plane via KKP. The question here being, do we want to make MC e2e jobs depend on KKP functionality, to which the answer is probably no.

Acceptance Criteria

@embik embik added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. labels Oct 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test.
Projects
None yet
Development

No branches or pull requests

1 participant