Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wireguard.cali interface does not have IPv4 address #8772

Open
geotransformer opened this issue Apr 28, 2024 · 7 comments
Open

wireguard.cali interface does not have IPv4 address #8772

geotransformer opened this issue Apr 28, 2024 · 7 comments

Comments

@geotransformer
Copy link

geotransformer commented Apr 28, 2024

======= wireguard interface has no ip address in one node of 3 nodes k8s cluster========
ubuntu@k8s-node3:~$ ifconfig wireguard.cali
wireguard.cali: flags=209<UP,POINTOPOINT,RUNNING,NOARP> mtu 1440
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 1000 (UNSPEC)
RX packets 299392534 bytes 133667595056 (133.6 GB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 263705618 bytes 55068382752 (55.0 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

=========kubeadm based k8s cluster
ubuntu@k8s-node3:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-node1 Ready control-plane 40h v1.26.5
k8s-node2 Ready control-plane 39h v1.26.5
k8s-node3 Ready control-plane 39h v1.26.5

========= pods scheduled
ubuntu@k8s-node3:$ kubectl get pods -A | wc -l
362
ubuntu@k8s-node3:
$ kubectl get pods -A -owide | grep k8s-node3 | wc -l
107

========= pod subnet =============
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.26.5
certificatesDir: /data/kubernetes/pki
networking:
serviceSubnet: 10.152.4.0/23
podSubnet: 10.152.2.0/23
apiServer:

Expected Behavior

ubuntu@k8s-node1:~$ ifconfig wireguard.cali
wireguard.cali: flags=209<UP,POINTOPOINT,RUNNING,NOARP> mtu 1440
inet 10.152.2.66 netmask 255.255.255.255 destination 10.152.2.66

Current Behavior

ubuntu@k8s-node3:~$ ifconfig wireguard.cali
wireguard.cali: flags=209<UP,POINTOPOINT,RUNNING,NOARP> mtu 1440
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 1000 (UNSPEC)
RX packets 299392534 bytes 133667595056 (133.6 GB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 263705618 bytes 55068382752 (55.0 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Possible Solution

Steps to Reproduce (for bugs)

  1. Upgrade k8s and os in a rolling fashion, one node at a time

Context

Your Environment

  • Calico version calicoctl v3_24
  • Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes
  • Operating System and version: Ubuntu 20.04.6 LTS
  • Link to your project (optional):
@geotransformer
Copy link
Author

geotransformer commented Apr 28, 2024

Calico node pod logs for the impacted node
ubuntu@k8s-node3:~$ kubectl get pods -A -owide | grep k8s-node3 | grep calico
kube-system calico-node-zbdf8 1/1 Running 1 40h 10.152.1.252 k8s-node3

ubuntu@k8s-node3:~$ date
Sun 28 Apr 2024 12:29:14 PM UTC

ubuntu@k8s-node3:~$ kubectl logs -n kube-system calico-node-zbdf8 | grep -i guard

****2024-04-28 12:28:48.242 [INFO][84] felix/int_dataplane.go 1946: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"k8s-node3" public_key:"vFqdz4DFUYlAaGzN4O3p7vkFfoxNr+aIY94e48lZ+mQ="

2024-04-28 12:28:51.607 [INFO][84] felix/int_dataplane.go 1946: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"k8s-node2" public_key:"+Ek2nxBsI60WEYfoMdQmAFUZFllR4dzB2yS80yjMDFQ=" interface_ipv4_addr:"10.152.2.131"
2024-04-28 12:28:54.351 [INFO][84] felix/int_dataplane.go 1946: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"k8s-node1" public_key:"iVS+PMScWh65pQS2yr0jcV9oPgsd3UbM/SwodOpB8nQ=" interface_ipv4_addr:"10.152.2.66"
2024-04-28 12:28:58.429 [INFO][84] felix/int_dataplane.go 1946: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"k8s-node3" public_key:"vFqdz4DFUYlAaGzN4O3p7vkFfoxNr+aIY94e48lZ+mQ="
2024-04-28 12:29:01.926 [INFO][84] felix/int_dataplane.go 1946: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"k8s-node2" public_key:"+Ek2nxBsI60WEYfoMdQmAFUZFllR4dzB2yS80yjMDFQ=" interface_ipv4_addr:"10.152.2.131"
2024-04-28 12:29:04.502 [INFO][84] felix/int_dataplane.go 1946: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"k8s-node1" public_key:"iVS+PMScWh65pQS2yr0jcV9oPgsd3UbM/SwodOpB8nQ=" interface_ipv4_addr:"10.152.2.66"
2024-04-28 12:29:08.555 [INFO][84] felix/int_dataplane.go 1946: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"k8s-node3" public_key:"vFqdz4DFUYlAaGzN4O3p7vkFfoxNr+aIY94e48lZ+mQ="
2024-04-28 12:29:12.032 [INFO][84] felix/int_dataplane.go 1946: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"k8s-node2" public_key:"+Ek2nxBsI60WEYfoMdQmAFUZFllR4dzB2yS80yjMDFQ=" interface_ipv4_addr:"10.152.2.131"

@tomastigera
Copy link
Contributor

Is the problem on a single node only or across all the node?

Upgrade k8s and os in a rolling fashion, one node at a time

Could you state in the description what got you to this state? You had a cluster working and then you upgraded k8s and os? Is this the first node updated? Is there incompatibility in wg getween the old nodes and new nodes?

@geotransformer
Copy link
Author

geotransformer commented Apr 30, 2024

wg

Is the problem on a single node only or across all the node?

Upgrade k8s and os in a rolling fashion, one node at a time

Could you state in the description what got you to this state? You had a cluster working and then you upgraded k8s and os? Is this the first node updated? Is there incompatibility in wg getween the old nodes and new nodes?

1> K8s upgraded from 1.25 to 1.26. Calico has no change, same version 3.24. The impacted node is not always the same node, node 2 or sometimes node3. The issue observed in 3~4 times out of 100 upgrades.

2> For k8s upgrade, the node will be cordoned, drained, and removed from the k8s cluster. Then it will be the OS upgrade, and kubeadm join the node back to the k8s cluster

3> If ip add del xxx dev wireguard.cali, the ip can be restored by calico itself. wondering why the following scenario it cannot be recovered itself.
****2024-04-28 12:28:48.242 [INFO][84] felix/int_dataplane.go 1946: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"k8s-node3" public_key:"vFqdz4DFUYlAaGzN4O3p7vkFfoxNr+aIY94e48lZ+mQ="

2024-04-28 12:28:51.607 [INFO][84] felix/int_dataplane.go 1946: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"k8s-node2" public_key:"+Ek2nxBsI60WEYfoMdQmAFUZFllR4dzB2yS80yjMDFQ=" interface_ipv4_addr:"10.152.2.131"

=========== the following is a capture when trying to reproduce the issue ====

node1" public_key:"oB8moC5Qw4tbVnyvjRlEi3abHkpU5k8YCalNqAy49ik=" interface_ipv4_addr:"10.28.2.133"
2024-04-29 22:13:53.354 [INFO][1258] felix/int_dataplane.go 1680: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"test-node1" public_key:"oB8moC5Qw4tbVnyvjRlEi3abHkpU5k8YCalNqAy49ik=" interface_ipv4_addr:"10.28.2.133"

2024-04-29 22:13:58.731 [INFO][1258] felix/int_dataplane.go 1680: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"test-node1" public_key:"oB8moC5Qw4tbVnyvjRlEi3abHkpU5k8YCalNqAy49ik=" interface_ipv4_addr:"10.28.2.133"

2024-04-29 22:17:53.521 [INFO][1258] felix/int_dataplane.go 1680: Received *proto.WireguardEndpointRemove update from calculation graph msg=hostname:"test-node1"

2024-04-29 22:18:07.680 [INFO][1258] felix/int_dataplane.go 1680: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"test-node1" public_key:"d5eb99Gp3YQYrXeBEWf7P0+QTF7Uof4g3s5dwkwONzU="

2024-04-29 22:18:07.733 [INFO][1258] felix/int_dataplane.go 1680: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"test-node1" public_key:"d5eb99Gp3YQYrXeBEWf7P0+QTF7Uof4g3s5dwkwONzU=" interface_ipv4_addr:"10.28.2.136"

@tomastigera
Copy link
Contributor

OK so the issue is isolated to individual nodes. Could you share full logs from a node? An issue might be compatibility with k8s 1.26. Calico 3.24 is not quite supported anymore. You might need to upgrade.

@coutinhop
Copy link
Contributor

@geotransformer may I also ask you to enable debug logging in felix? Set logSeverityScreen to Debug in the default FelixConfiguration: https://docs.tigera.io/calico/latest/operations/troubleshoot/component-logs#configure-felix-log-level

@geotransformer
Copy link
Author

geotransformer commented May 1, 2024

OK so the issue is isolated to individual nodes. Could you share full logs from a node? An issue might be compatibility with k8s 1.26. Calico 3.24 is not quite supported anymore. You might need to upgrade.

We observed the same issue on Calico 3.27.

One thing we would like share here first. The pod subnet configured in this 3 node clusters is /23. We use the default kubernetes/Calico config. So one node cannot get a /24 cidr. Kubernetes complained the cidr is not available for node3. In Calico, I believe ipam manages the ip block and allocation. So this warning or error message seems not a SOS issue.

Also in our 3 node deployment, we have 360+ pods and need 310+ pod ips. During the upgrade, nodes will be cordoned and drained, and pods will be created again on the node. @coutinhop is there some race condition for ip recycle and reuse for calico Interfaces, like Wireguard.cali

@geotransformer
Copy link
Author

@geotransformer may I also ask you to enable debug logging in felix? Set logSeverityScreen to Debug in the default FelixConfiguration: https://docs.tigera.io/calico/latest/operations/troubleshoot/component-logs#configure-felix-log-level

Yes, we will try to enable this in our automation testing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants