BREAKING CHANGE: proposal: make `kubernetes_host` array to provide fallback mechanism and retry resiliency #157

Dentrax · 2022-07-20T15:19:48Z

Abstract

In current implementation of kubernetes_host only takes string type as we see in the scheme. The problem here is that we can only pass:

single Kubernetes API server
TCP load balancer in front of master nodes ¹

This may be normally achieved via a service/LB in front of master nodes in the cluster. But this requires a lot of hard work in typical infrastructure. People also may don't want to bring additional overhead to bring some additional resiliency just for Vault Kubernetes auth method.

Problem

We are (@yilmazo @erkanzileli @developer-guy) filed this issue because one of the our master nodes (the one we set to kubernetes_host variable) has down and caused an incident in slight time window. We actually don't have a LB in front of master nodes. If we had, we probably wouldn't have this issue.

What if we had a LB, and it gets down in this case? We still don't cover this scenario.

Solution

We should provide a solution that covers the following two scenarios:

for the consumers who don't use LB
for the consumers who use LB already

We should implement a fallback mechanism and provide some resiliency methods:

BREAKING!: make kubernetes_host as a string array: []string

If we get 5xx or similar error from host[0], fallback to host[1]: try host[index] -> host[index+1] until last one.

Resiliency: we should retry while talking with Kubernetes API.

For example, we can implement retryable http when calling API in token review function.

Both following above ideas are essential to provide highly resilient system since we use Vault Kubernetes Auth with a production Kubernetes cluster.

Alternative Solution

create TCP load balancer infra from scratch to put in front of master nodes and ignore this issue
create a Kubernetes Operator from scratch to watch shared informers. If any change observed across master nodes (i.e, if one of down), call Vault API to update kubernetes_host key in auth/kubernetes/config path ² with another master node that is actively in running and healthy state

Similar problem has been previously discussed at hashicorp/vault#5408 almost 3 years ago, so we came up with the new proposal since the main issue hasn't been resolved yet.

cc @briankassouf @catsby @jefferai fyi @mitchellmaler @m1kola

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BREAKING CHANGE: proposal: make `kubernetes_host` array to provide fallback mechanism and retry resiliency #157

BREAKING CHANGE: proposal: make `kubernetes_host` array to provide fallback mechanism and retry resiliency #157

Dentrax commented Jul 20, 2022 •

edited

BREAKING CHANGE: proposal: make kubernetes_host array to provide fallback mechanism and retry resiliency #157

BREAKING CHANGE: proposal: make kubernetes_host array to provide fallback mechanism and retry resiliency #157

Comments

Dentrax commented Jul 20, 2022 • edited

Footnotes

BREAKING CHANGE: proposal: make `kubernetes_host` array to provide fallback mechanism and retry resiliency #157

BREAKING CHANGE: proposal: make `kubernetes_host` array to provide fallback mechanism and retry resiliency #157

Dentrax commented Jul 20, 2022 •

edited