Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BREAKING CHANGE: proposal: make kubernetes_host array to provide fallback mechanism and retry resiliency #157

Open
Dentrax opened this issue Jul 20, 2022 · 0 comments

Comments

@Dentrax
Copy link

Dentrax commented Jul 20, 2022

Abstract

In current implementation of kubernetes_host only takes string type as we see in the scheme. The problem here is that we can only pass:

  • single Kubernetes API server
  • TCP load balancer in front of master nodes 1

This may be normally achieved via a service/LB in front of master nodes in the cluster. But this requires a lot of hard work in typical infrastructure. People also may don't want to bring additional overhead to bring some additional resiliency just for Vault Kubernetes auth method.

Problem

We are (@yilmazo @erkanzileli @developer-guy) filed this issue because one of the our master nodes (the one we set to kubernetes_host variable) has down and caused an incident in slight time window. We actually don't have a LB in front of master nodes. If we had, we probably wouldn't have this issue.

What if we had a LB, and it gets down in this case? We still don't cover this scenario.

Solution

We should provide a solution that covers the following two scenarios:

  • for the consumers who don't use LB
  • for the consumers who use LB already

We should implement a fallback mechanism and provide some resiliency methods:

  1. BREAKING!: make kubernetes_host as a string array: []string

If we get 5xx or similar error from host[0], fallback to host[1]: try host[index] -> host[index+1] until last one.

  1. Resiliency: we should retry while talking with Kubernetes API.

For example, we can implement retryable http when calling API in token review function.

Both following above ideas are essential to provide highly resilient system since we use Vault Kubernetes Auth with a production Kubernetes cluster.

Alternative Solution

  1. create TCP load balancer infra from scratch to put in front of master nodes and ignore this issue
  2. create a Kubernetes Operator from scratch to watch shared informers. If any change observed across master nodes (i.e, if one of down), call Vault API to update kubernetes_host key in auth/kubernetes/config path 2 with another master node that is actively in running and healthy state

Similar problem has been previously discussed at hashicorp/vault#5408 almost 3 years ago, so we came up with the new proposal since the main issue hasn't been resolved yet.

cc @briankassouf @catsby @jefferai fyi @mitchellmaler @m1kola

Footnotes

  1. https://github.com/hashicorp/vault/issues/5408#issuecomment-640946258

  2. https://github.com/hashicorp/vault/issues/6987

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant