Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle API server down #9

Open
rata opened this issue Mar 12, 2021 · 1 comment
Open

Handle API server down #9

rata opened this issue Mar 12, 2021 · 1 comment
Labels
kind/enhancement New feature or request

Comments

@rata
Copy link
Member

rata commented Mar 12, 2021

Description

If the API server is not available, we log an error but perform calls (like mkdir) without added info.

Impact

It was really unclear to me why I wasn't seeing directories created with the ns name.

Environment and steps to reproduce

  1. Set-up: Agent deployed to kubernetes as install docs say
  2. Task: Create the example pod from install docs
  3. Action(s): kubectl exec -ti bash and run mkdir /tmp/a in the pod
  4. Error: If the agent can't contact the apiserver, the command will just create dir a instead of a-ns-something.

Expected behavior
I'd like to see a more clear message explaining what will happen when the API server can't be reached. Like, "API server down, will proceed with XXX".

This seems specially important to handle correctly when a pod is restarted (or killed if the kubelet restarts them). My gut feeling is that if we kubectl delete pod <pod>, kubelet will re-create it (if it is on a deployment) when the APIserver id down. At least that happens on my setup, but I think the kubelet can contact the apiserver here, just the agent is blocked due to the firewall. We should check on a real API server down. And IIUC the seccomp agent will query the apiserver in this restart case.

Assuming that happens, the following seems important to me:
Kubernetes is generally made to work when the API server is down. The kubelet should continue to work fine and restart pods already scheduled, etc. That is a big guarantee Kubernetes gives and we shouldn't break it when using the seccomp agent (in the future, it is ok at this stage :))

Additional information
I happen to hit this setup because I have a local firewall that blocks traffic and NetworkManager applies constantly. It was nice to narrow down, though, as I now have a better setup locally :)

@rata rata added the kind/enhancement New feature or request label Mar 12, 2021
@alban
Copy link
Member

alban commented Mar 12, 2021

Good finding, I agree.

Some related thoughts:

  • In the current code, the call to the API server happens for each new seccomp-fd but not for each syscall. So I think that even if the firewall is "fixed", currently running pods would still show the wrong behaviour.
  • With the new containerd >= v1.5.0-beta.1 (cri: add annotations for pod name and namespace containerd/containerd#4922) or with cri-o, we have enough annotations included in the message sent via the seccomp agent socket, so a call to the API server to get the pod info is no longer necessary (see code).
  • But with my opa branch, I am planning to make new calls to the API server for getting custom resources in a similar way to Gatekeeper's constraints. That would be working as a cache though, so API server calls would not be performed during the syscall critical path, so it would not matter if the API server is down for a little while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants