Handle API server down #9

rata · 2021-03-12T13:54:08Z

Description

If the API server is not available, we log an error but perform calls (like mkdir) without added info.

Impact

It was really unclear to me why I wasn't seeing directories created with the ns name.

Environment and steps to reproduce

Set-up: Agent deployed to kubernetes as install docs say
Task: Create the example pod from install docs
Action(s): kubectl exec -ti bash and run mkdir /tmp/a in the pod
Error: If the agent can't contact the apiserver, the command will just create dir a instead of a-ns-something.

Expected behavior
I'd like to see a more clear message explaining what will happen when the API server can't be reached. Like, "API server down, will proceed with XXX".

This seems specially important to handle correctly when a pod is restarted (or killed if the kubelet restarts them). My gut feeling is that if we kubectl delete pod <pod>, kubelet will re-create it (if it is on a deployment) when the APIserver id down. At least that happens on my setup, but I think the kubelet can contact the apiserver here, just the agent is blocked due to the firewall. We should check on a real API server down. And IIUC the seccomp agent will query the apiserver in this restart case.

Assuming that happens, the following seems important to me:
Kubernetes is generally made to work when the API server is down. The kubelet should continue to work fine and restart pods already scheduled, etc. That is a big guarantee Kubernetes gives and we shouldn't break it when using the seccomp agent (in the future, it is ok at this stage :))

Additional information
I happen to hit this setup because I have a local firewall that blocks traffic and NetworkManager applies constantly. It was nice to narrow down, though, as I now have a better setup locally :)

The text was updated successfully, but these errors were encountered:

alban · 2021-03-12T15:37:01Z

Good finding, I agree.

Some related thoughts:

In the current code, the call to the API server happens for each new seccomp-fd but not for each syscall. So I think that even if the firewall is "fixed", currently running pods would still show the wrong behaviour.
With the new containerd >= v1.5.0-beta.1 (cri: add annotations for pod name and namespace containerd/containerd#4922) or with cri-o, we have enough annotations included in the message sent via the seccomp agent socket, so a call to the API server to get the pod info is no longer necessary (see code).
But with my opa branch, I am planning to make new calls to the API server for getting custom resources in a similar way to Gatekeeper's constraints. That would be working as a cache though, so API server calls would not be performed during the syscall critical path, so it would not matter if the API server is down for a little while.

rata added the kind/enhancement New feature or request label Mar 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle API server down #9

Handle API server down #9

rata commented Mar 12, 2021

alban commented Mar 12, 2021

Handle API server down #9

Handle API server down #9

Comments

rata commented Mar 12, 2021

alban commented Mar 12, 2021