Fix machine-controller CNI setup for fully joining nodes to kind control plane #1462

embik · 2022-10-14T08:26:49Z

This is a follow-up from the issues worked around in #1459.

Background

In #1304, we changed our CI setup to get rid of the Hetzner VMs built for each CI job and to mirror what we have been doing in KKP CI for some time now. The big difference between the KKP tests and the MC tests is that in KKP, the KKP control plane runs in kind only. In the MC tests, we join Machines to the kind control plane directly. Because of that, using the default CNI (kindnet) did not work and we opted for using flannel as CNI. This appeared to work fine, tests were passing, Nodes were marked as ready upon joining the kind control plane.

Current Problem

Recently, we upgraded our CI environment and the underlying container runtime switched from docker to containerd. That apparently broke the kind control plane using flannel as CNI. In specific, requests from machine-controller-webhook to cloud provider APIs failed, and after some investigation the problem appeared to be DNS resolution through the in-cluster DNS service IP. This only happens with flannel and it's not clear why, but our nested container setup is probably fairly unique with its problems.

So the idea was to replace the CNI. Both Calico and Cilium were tried. After some more investigation, the following underlying problem was identified in the test architecture:

The Kubernetes API is not accessible as in-cluster service from any of the nodes, because the Endpoint resource in the "kubernetes" Service points to a 172.16.0.0/16 address. That means calling the kubernetes.default.svc.cluster.local endpoint from a Pod on a Node joined to the kind control plane cannot work. Overriding the advertised IP address is also not possible, because the "kubernetes" Service is exposed as a NodePort to allow the whole cluster-exposer logic we do to make it accessible to Nodes in the first place. If you update the advertised IP to something publicly accessible, you create a loop (service endpoint points to public IP, public IP + Port point to "kubernetes" Service, which uses the public IP as endpoint, and so on).

CNI pods can therefore not talk to the Kubernetes API to properly initialise Nodes into the pod overlay network, and therefore Nodes cannot ever be ready, which is something we want to verify in machine-controller e2e tests (but will be removed via #1459).

Why is this not a problem with KKP tests?

For KKP tests, no Nodes are joined to the kind cluster. Instead, kind is used to host KKP user cluster control planes, which are built for this purpose and can be used from the outside without the same set of problems (because the Kubernetes API endpoints are routed directly to the kube-apiserver instances run for a user cluster control plane).

Why does this work with some CNIs?

Pretty good question, the gist seems to be that CNIs do the Node initialisation in different ways. Some CNIs appear to work but don't provide functional network. The differences in how CNI initialise seem to be the cause for different behaviour. But networking to services running on the kind control plane cannot work in the current setup, so we never had functional Nodes, even if they were marked as ready. Calico just uncovers the problem by crashing early.

How to solve

We need to solve exposing the control plane properly. There might be options to that with the current kind setup, but an alternative would be to launch a user cluster control plane via KKP. The question here being, do we want to make MC e2e jobs depend on KKP functionality, to which the answer is probably no.

Acceptance Criteria

Nodes get properly initialised and Pods on them can use the in-cluster service endpoints for services running on the control plane (Kubernetes API, DNS, etc).
Checks for Node readiness during e2e testing are re-implemented (removed in Use Calico as CNI for e2e and disable node readiness checks #1459) .

The text was updated successfully, but these errors were encountered:

embik added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. labels Oct 14, 2022

embik mentioned this issue Oct 14, 2022

Use Calico as CNI for e2e and disable node readiness checks #1459

Merged

ahmedwaleedmalik mentioned this issue May 3, 2023

Reworking E2E tests for Machine Controller to support external CCM #1626

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix machine-controller CNI setup for fully joining nodes to kind control plane #1462

Fix machine-controller CNI setup for fully joining nodes to kind control plane #1462

embik commented Oct 14, 2022 •

edited

Fix machine-controller CNI setup for fully joining nodes to kind control plane #1462

Fix machine-controller CNI setup for fully joining nodes to kind control plane #1462

Comments

embik commented Oct 14, 2022 • edited

Background

Current Problem

Why is this not a problem with KKP tests?

Why does this work with some CNIs?

How to solve

Acceptance Criteria

embik commented Oct 14, 2022 •

edited