Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kube-VIP does not recover after network outages #842

Open
ps-muspelkat opened this issue May 7, 2024 · 0 comments
Open

Kube-VIP does not recover after network outages #842

ps-muspelkat opened this issue May 7, 2024 · 0 comments

Comments

@ps-muspelkat
Copy link

Describe the bug
We had a network outage. Our kubernetes clusters are spread accross multiple datacenters. When the outage happened,
the internal LAN / network connection between the datacenters, and therefore also the kubernetes nodes that were spread across datacenters, was broken / unavailable.
The ingress stopped working and would not come back on its own after the network has recovered.

To Reproduce
Unfurtunately, this is not reproducable for us.

Expected behavior
Kube-VIP becomes available once network and ARP requests are working again.

Environment (please complete the following information):

  • OS/Distro: Ubuntu 22.04
  • Kubernetes Version: v1.27.10
  • Kube-vip Version: v0.5.11
  • Kube-vip-cloud-provider Version: v0.0.4

Kube-vip:
We use helm. Our adjusted values besides image tag v0.5.11:

svc_election: true
vip_leaderelection: true
cp_namespace: kube-vip

Kube-vip-cloud-controller:
We use helm. Everything besides image tag v0.04 is default.

Additional context
We got strange behaviour from kube-vip, which resulted in many logs similar to this (output of one of the kube-vip pods):

Logs
"I0502 21:16:32.600639 1 leaderelection.go:258] successfully acquired lease log-agent/kubevip-service1"
"time=""2024-05-02T21:16:32Z"" level=info msg=""[service] adding VIP [10.20.3.236] for [log-agent/service1] """
"I0502 21:16:35.219331 1 leaderelection.go:258] successfully acquired lease log-agent/kubevip-service2"
"time=""2024-05-02T21:16:35Z"" level=info msg=""[service] adding VIP [10.20.3.236] for [log-agent/service2] """
"E0502 21:16:35.601578 1 leaderelection.go:330] error retrieving resource lock log-agent/kubevip-service1: Get ""https://10.43.0.1:443/apis/coordination.k8s.io/v1/namespaces/log-agent/leases/kubevip-service1"": context deadline exceeded"
"I0502 21:16:35.601612 1 leaderelection.go:283] failed to renew lease log-agent/kubevip-service1: timed out waiting for the condition"
"E0502 21:16:35.601644 1 leaderelection.go:306] Failed to release lock: resource name may not be empty"
"time=""2024-05-02T21:16:35Z"" level=info msg=""[services election] service [service1] leader lost: [rke-worker-prod9]"""
"time=""2024-05-02T21:16:35Z"" level=info msg=""[LOADBALANCER] Stopping load balancers"""
"time=""2024-05-02T21:16:35Z"" level=info msg=""[VIP] Releasing the Virtual IP [10.20.3.236]"""
"time=""2024-05-02T21:16:35Z"" level=info msg=""Removed [163d2fa5-e1f8-4c9b-a528-9f9d2c962510] from manager, [1] advertised services remain"""
"time=""2024-05-02T21:16:35Z"" level=info msg=""[services election] for service [service1] stopping"""
"E0502 21:16:38.221061 1 leaderelection.go:330] error retrieving resource lock log-agent/kubevip-service2: Get ""https://10.43.0.1:443/apis/coordination.k8s.io/v1/namespaces/log-agent/leases/kubevip-service2"": context deadline exceeded"
"I0502 21:16:38.221094 1 leaderelection.go:283] failed to renew lease log-agent/kubevip-service2: timed out waiting for the condition"
"E0502 21:16:38.221126 1 leaderelection.go:306] Failed to release lock: resource name may not be empty"
"time=""2024-05-02T21:16:38Z"" level=info msg=""[services election] service [service2] leader lost: [rke-worker-prod9]"""
"time=""2024-05-02T21:16:38Z"" level=info msg=""[LOADBALANCER] Stopping load balancers"""
"time=""2024-05-02T21:16:38Z"" level=info msg=""[VIP] Releasing the Virtual IP [10.20.3.236]"""
"time=""2024-05-02T21:16:38Z"" level=info msg=""Removed [fa16f85d-a757-4e68-b322-d0a1d01706b2] from manager, [0] advertised services remain"""
"time=""2024-05-02T21:16:38Z"" level=info msg=""[services election] for service [service2] stopping"""
"time=""2024-05-02T21:16:43Z"" level=error msg=""Failed to set Services: rpc error: code = Unavailable desc = error reading from server: read tcp 10.20.0.59:34492->10.20.0.67:2379: read: connection timed out"""
"time=""2024-05-02T21:16:43Z"" level=error msg=""Failed to set Services: rpc error: code = Unavailable desc = error reading from server: read tcp 10.20.0.59:34492->10.20.0.67:2379: read: connection timed out"""
"time=""2024-05-02T21:16:43Z"" level=error msg=""Error attempting to watch Kubernetes services"""
"time=""2024-05-02T21:16:43Z"" level=error msg=""services -> {{ } { <nil>} Failure The resourceVersion for the provided watch is too old. Expired nil 410}"""
"time=""2024-05-02T21:16:43Z"" level=warning msg=""Stopping watching services for type: LoadBalancer in all namespaces"""
"time=""2024-05-02T21:16:43Z"" level=info msg=""Shutting down kube-Vip"""
"time=""2024-05-02T21:16:43Z"" level=info msg=""Starting kube-vip.io [v0.5.10]"""
"time=""2024-05-02T21:16:43Z"" level=info msg=""namespace [kube-vip], Mode: [ARP], Features(s): Control Plane:[false], Services:[true]"""
"time=""2024-05-02T21:16:43Z"" level=info msg=""No interface is specified for VIP in config, auto-detecting default Interface"""
"time=""2024-05-02T21:16:43Z"" level=info msg=""prometheus HTTP server started"""
"time=""2024-05-02T21:16:43Z"" level=info msg=""kube-vip will bind to interface [eth0]"""
"time=""2024-05-02T21:16:43Z"" level=info msg=""Starting Kube-vip Manager with the ARP engine"""
"time=""2024-05-02T21:16:43Z"" level=info msg=""beginning watching services, leaderelection will happen for every service"""
"time=""2024-05-02T21:16:43Z"" level=info msg=""starting services watcher for all namespaces"""
"time=""2024-05-02T21:16:43Z"" level=info msg=""[services election] for service [service1], namespace [log-agent], lock name [kubevip-service1], host id [rke-worker-prod9]"""
"I0502 21:16:43.387249 1 leaderelection.go:248] attempting to acquire leader lease log-agent/kubevip-service1..."
"time=""2024-05-02T21:16:43Z"" level=info msg=""[services election] for service [service3], namespace [log-agent], lock name [kubevip-service3], host id [rke-worker-prod9]"""
"I0502 21:16:43.387424 1 leaderelection.go:248] attempting to acquire leader lease log-agent/kubevip-service3..."
"time=""2024-05-02T21:16:43Z"" level=info msg=""[services election] for service [service2], namespace [log-agent], lock name [kubevip-service2], host id [rke-worker-prod9]"""
"I0502 21:16:43.387603 1 leaderelection.go:248] attempting to acquire leader lease log-agent/kubevip-service2..."
"time=""2024-05-02T21:16:43Z"" level=info msg=""[services election] for service [service4], namespace [log-agent], lock name [kubevip-service4], host id [rke-worker-prod9]"""
"I0502 21:16:43.387759 1 leaderelection.go:248] attempting to acquire leader lease log-agent/kubevip-service4..."
"time=""2024-05-02T21:16:43Z"" level=info msg=""[services election] for service [service5], namespace [log-agent], lock name [kubevip-service5], host id [rke-worker-prod9]"""
"I0502 21:16:43.387865 1 leaderelection.go:248] attempting to acquire leader lease log-agent/kubevip-service5..."
"time=""2024-05-02T21:16:43Z"" level=info msg=""[services election] for service [ingress-nginx-controller], namespace [ingress-nginx], lock name [kubevip-ingress-nginx-controller], host id [rke-worker-prod9]"""
"I0502 21:16:43.388476 1 leaderelection.go:248] attempting to acquire leader lease ingress-nginx/kubevip-ingress-nginx-controller..."
"time=""2024-05-02T21:16:43Z"" level=info msg=""[services election] new leader elected: rke-worker-prod5"""
"time=""2024-05-02T21:16:43Z"" level=info msg=""[services election] new leader elected: rke-worker-prod4"""
"I0502 21:16:43.393371 1 leaderelection.go:258] successfully acquired lease log-agent/kubevip-service1"
"time=""2024-05-02T21:16:43Z"" level=info msg=""[service] adding VIP [10.20.3.236] for [log-agent/service1] """
"I0502 21:16:43.396161 1 leaderelection.go:258] successfully acquired lease log-agent/kubevip-service3"
"time=""2024-05-02T21:16:43Z"" level=info msg=""[service] adding VIP [10.20.3.236] for [log-agent/service3] """
"E0502 21:16:46.394107 1 leaderelection.go:330] error retrieving resource lock log-agent/kubevip-service1: Get ""https://10.43.0.1:443/apis/coordination.k8s.io/v1/namespaces/log-agent/leases/kubevip-service1"": context deadline exceeded"
"I0502 21:16:46.394150 1 leaderelection.go:283] failed to renew lease log-agent/kubevip-service1: timed out waiting for the condition"
"E0502 21:16:46.394176 1 leaderelection.go:306] Failed to release lock: resource name may not be empty"
"time=""2024-05-02T21:16:46Z"" level=info msg=""[services election] service [service1] leader lost: [rke-worker-prod9]"""
"time=""2024-05-02T21:16:46Z"" level=info msg=""[LOADBALANCER] Stopping load balancers"""
"time=""2024-05-02T21:16:46Z"" level=info msg=""[VIP] Releasing the Virtual IP [10.20.3.236]"""
"time=""2024-05-02T21:16:46Z"" level=info msg=""Removed [163d2fa5-e1f8-4c9b-a528-9f9d2c962510] from manager, [1] advertised services remain"""
"time=""2024-05-02T21:16:46Z"" level=info msg=""[services election] for service [service1] stopping"""
"time=""2024-05-02T21:16:46Z"" level=warning msg=""Re-applying the VIP configuration [10.20.3.236] to the interface [eth0]"""
"E0502 21:16:47.593439 1 leaderelection.go:330] error retrieving resource lock log-agent/kubevip-service3: Get ""https://10.43.0.1:443/apis/coordination.k8s.io/v1/namespaces/log-agent/leases/kubevip-service3"": context deadline exceeded"
"I0502 21:16:47.593466 1 leaderelection.go:283] failed to renew lease log-agent/kubevip-service3: timed out waiting for the condition"
"E0502 21:16:47.593498 1 leaderelection.go:306] Failed to release lock: resource name may not be empty"
"time=""2024-05-02T21:16:47Z"" level=info msg=""[services election] service [service3] leader lost: [rke-worker-prod9]"""
"time=""2024-05-02T21:16:47Z"" level=info msg=""[LOADBALANCER] Stopping load balancers"""
"time=""2024-05-02T21:16:47Z"" level=info msg=""[VIP] Releasing the Virtual IP [10.20.3.236]"""
"time=""2024-05-02T21:16:47Z"" level=info msg=""Removed [3df4d7c5-3842-4c1c-8c5e-597e983273fb] from manager, [0] advertised services remain"""
"time=""2024-05-02T21:16:47Z"" level=info msg=""[services election] for service [service3] stopping"""
"time=""2024-05-02T21:16:48Z"" level=info msg=""[services election] new leader elected: rke-worker-prod8"""
"time=""2024-05-02T21:16:48Z"" level=info msg=""[services election] new leader elected: rke-worker-prod2"""
"time=""2024-05-02T21:16:48Z"" level=info msg=""[service] synchronised in 4954ms"""
"time=""2024-05-02T21:16:49Z"" level=info msg=""[services election] new leader elected: rke-worker-prod5"""
"time=""2024-05-02T21:16:49Z"" level=info msg=""[services election] new leader elected: rke-worker-prod2"""
"time=""2024-05-02T21:16:51Z"" level=info msg=""[services election] new leader elected: rke-worker-prod4"""
"time=""2024-05-02T21:16:53Z"" level=error msg=""Failed to set Services: rpc error: code = Unavailable desc = error reading from server: read tcp 10.20.0.61:53214->10.20.0.67:2379: read: connection timed out"""
"time=""2024-05-02T21:16:53Z"" level=info msg=""[services election] for service [service3], namespace [log-agent], lock name [kubevip-service3], host id [rke-worker-prod9]"""
"I0502 21:16:53.690223 1 leaderelection.go:248] attempting to acquire leader lease log-agent/kubevip-service3..."
"time=""2024-05-02T21:16:53Z"" level=info msg=""[services election] new leader elected: rke-worker-prod4"""
"I0502 21:16:32.600639 1 leaderelection.go:258] successfully acquired lease log-agent/kubevip-service1"
"time=""2024-05-02T21:16:32Z"" level=info msg=""[service] adding VIP [10.20.3.236] for [log-agent/service1] """
"I0502 21:16:35.219331 1 leaderelection.go:258] successfully acquired lease log-agent/kubevip-service2"
"time=""2024-05-02T21:16:35Z"" level=info msg=""[service] adding VIP [10.20.3.236] for [log-agent/service2] """
"E0502 21:16:35.601578 1 leaderelection.go:330] error retrieving resource lock log-agent/kubevip-service1: Get ""https://10.43.0.1:443/apis/coordination.k8s.io/v1/namespaces/log-agent/leases/kubevip-service1"": context deadline exceeded"
"I0502 21:16:35.601612 1 leaderelection.go:283] failed to renew lease log-agent/kubevip-service1: timed out waiting for the condition"
"E0502 21:16:35.601644 1 leaderelection.go:306] Failed to release lock: resource name may not be empty"
"time=""2024-05-02T21:16:35Z"" level=info msg=""[services election] service [service1] leader lost: [rke-worker-prod9]"""
"time=""2024-05-02T21:16:35Z"" level=info msg=""[LOADBALANCER] Stopping load balancers"""
"time=""2024-05-02T21:16:35Z"" level=info msg=""[VIP] Releasing the Virtual IP [10.20.3.236]"""
"time=""2024-05-02T21:16:35Z"" level=info msg=""Removed [163d2fa5-e1f8-4c9b-a528-9f9d2c962510] from manager, [1] advertised services remain"""
"time=""2024-05-02T21:16:35Z"" level=info msg=""[services election] for service [service1] stopping"""
"E0502 21:16:38.221061 1 leaderelection.go:330] error retrieving resource lock log-agent/kubevip-service2: Get ""https://10.43.0.1:443/apis/coordination.k8s.io/v1/namespaces/log-agent/leases/kubevip-service2"": context deadline exceeded"
"I0502 21:16:38.221094 1 leaderelection.go:283] failed to renew lease log-agent/kubevip-service2: timed out waiting for the condition"
"E0502 21:16:38.221126 1 leaderelection.go:306] Failed to release lock: resource name may not be empty"
"time=""2024-05-02T21:16:38Z"" level=info msg=""[services election] service [service2] leader lost: [rke-worker-prod9]"""
"time=""2024-05-02T21:16:38Z"" level=info msg=""[LOADBALANCER] Stopping load balancers"""
"time=""2024-05-02T21:16:38Z"" level=info msg=""[VIP] Releasing the Virtual IP [10.20.3.236]"""
"time=""2024-05-02T21:16:38Z"" level=info msg=""Removed [fa16f85d-a757-4e68-b322-d0a1d01706b2] from manager, [0] advertised services remain"""
"time=""2024-05-02T21:16:38Z"" level=info msg=""[services election] for service [service2] stopping"""
"time=""2024-05-02T21:16:43Z"" level=error msg=""Failed to set Services: rpc error: code = Unavailable desc = error reading from server: read tcp 10.20.0.59:34492->10.20.0.67:2379: read: connection timed out"""
"time=""2024-05-02T21:16:43Z"" level=error msg=""Failed to set Services: rpc error: code = Unavailable desc = error reading from server: read tcp 10.20.0.59:34492->10.20.0.67:2379: read: connection timed out"""
"time=""2024-05-02T21:16:43Z"" level=error msg=""Error attempting to watch Kubernetes services"""
"time=""2024-05-02T21:16:43Z"" level=error msg=""services -> {{ } { <nil>} Failure The resourceVersion for the provided watch is too old. Expired nil 410}"""
"time=""2024-05-02T21:16:43Z"" level=warning msg=""Stopping watching services for type: LoadBalancer in all namespaces"""
"time=""2024-05-02T21:16:43Z"" level=info msg=""Shutting down kube-Vip"""
"time=""2024-05-02T21:53:55Z"" level=info msg=""[services election] for service [service1], namespace [log-agent], lock name [kubevip-service1], host id [rke-worker-prod9]"""
"I0502 21:53:55.845366 1 leaderelection.go:248] attempting to acquire leader lease log-agent/kubevip-service1..."
"time=""2024-05-02T21:53:55Z"" level=info msg=""[services election] new leader elected: rke-worker-prod5"""
"time=""2024-05-02T22:29:43Z"" level=info msg=""Received kube-vip termination, signaling shutdown"""
"time=""2024-05-02T22:29:43Z"" level=warning msg=""Stopping watching services for type: LoadBalancer in all namespaces"""
"time=""2024-05-02T22:29:43Z"" level=info msg=""Shutting down kube-Vip"""

At the end, we restarted the kube-vip daemonset. Immediately, kube-vip and ingress started working again.

Given the information and log output, is there something you can identify out of this? We can provide additional logs, if they help.
Ideally, we would like to have some tips on how we can reduce the impact of such outages towards kube-vip.

We know that we're running on old versions, and will take this as the next steps to hopefully mitigate these type of issues in the future.

Thanks in advance for any help or hints on this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant