-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to create endpoint: Cilium API client timeout exceeded #32399
Comments
Hey, I literally got the same issue, I also opened a support case on GCP, see if it helps, I'll update here if I have some news! |
Oh, that's great. Please keep me posted. |
Will do! Nope it's deployed with pulumi for me (through native k8s objects). |
No, not in particular. Especially for RabbitMQ, the service that is suffering the most from this issue. |
We encountered a similar issue after upgrading our GKE cluster from 1.29.1-gke.1589000 to 1.29.4-gke.1447000. In our case, downgrading the cluster helped, and we were able to fix the issue. @Sh4d1, any luck with GCP support? |
Interesting! I'm on 1.29.3. And no luck with the support yet (I've linked them the issue as well). |
in my case, this happens after node upgrade since we use rapid channel. but not always, sometimes pod recreated successfully. also, for now this happens only to statefulsets. |
@harispraba @Sh4d1 |
Hum I think it's only happened to statefull sets on my end as well! @FranAguiar it's private, and no luck yet (they asked for the whole logs from cilium but I don't have it anymore, so waiting for next occurrence to catch it) |
Assigning to @christarazi, who recently worked on endpoint regeneration and statefulset updates. |
recently faced this as well. In my case, it turns out that Cilium fails to create an endpoint due to the number of character which should be no more than 63.
renaming the chart name to a fewer characters fixed it. |
Can you try v1.15.5? It contains #31605 which might resolve this problem. |
Hello, I just update my GKE cluster to latest version
And it comes with new cilium version
I hope that solve the issue |
I have the issue described in OP on 1.15.5 on my homelab baremetal Talos cluster, but it seems to only happen on one node. All pods on that node fails to schedule and the only useful logs are the exact ones reported in OP. Killing the Cilium pod on that node fixes it for a while until it happens again, and sometimes it happens right from the start of that Cilium pod's lifetime and that pod needds to be killed too. Versions: Cilium Helm values (these 2 files get merged by Flux Helm controller, and the hr.yaml will override the config/biohazard/helm-values.yaml if there are conflicting values): https://github.com/JJGadgets/Biohazard/blob/7004140fc1be893e1e35dac1d43148af749eb8da/kube/deploy/core/_networking/cilium/app/hr.yaml https://github.com/JJGadgets/Biohazard/blob/7004140fc1be893e1e35dac1d43148af749eb8da/kube/deploy/core/_networking/cilium/app/config/biohazard/helm-values.yaml |
@JJGadgets Are the workloads statefulsets? If so, please provide the Cilium logs when that occurs. |
@christarazi Nope, everything from deployments, daemonsets, jobs, to KubeVirt VMs (which is a custom controller AFAIK). |
@JJGadgets Ok, that sounds like a separate issue from this thread. It seems that the initial report is for statefulsets. I would encourage you to file a new issue with a sysdump of when the issue occurred. |
@christarazi will create the separate issue when I encounter the issue again, for now the node and its Cilium pod is happy. |
Just happen again, this is the log from the cilium container @Sh4d1 Share this log with google support if you want |
Any version below 1.15.5 will not have the statefulset fix, so please try upgrading to that. |
In my case, not only were stateful sets affected, but deployments were as well. |
@r0bj That sounds like a separate issue as mentioned in #32399 (comment) |
Hey! I have a similiar problem with some of my pods in a AWS cluster. This is the error message:
Cilium verison is |
Is there an existing issue for this?
What happened?
Since a few weeks ago, some pods in my GKE cluster have been getting stuck in the PodCreating state. When I run a kubectl pod describe, I get this error:
And in the logs of the cilium-agent container I found this:
If I manually delete the pod it start without any issue.
Cilium Version
Client: 1.13.12 38d04fa903 2024-04-05T00:06:43+00:00 go version go1.21.8 linux/amd64
Daemon: 1.13.12 38d04fa903 2024-04-05T00:06:43+00:00 go version go1.21.8 linux/amd64
KVStore: Ok Disabled
Kubernetes: Ok 1.29 (v1.29.4-gke.1043000) [linux/amd64]
Kubernetes APIs: ["cilium/v2::CiliumLocalRedirectPolicy", "cilium/v2::CiliumNode", "cilium/v2alpha1::CiliumEndpointSlice", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement: Strict [eth0 10.3.2.219 (Direct Routing)]
Host firewall: Disabled
CNI Chaining: generic-veth
CNI Config file: CNI configuration file management disabled
Cilium: Ok 1.13.12 (v1.13.12-38d04fa903)
NodeMonitor: Listening for events on 8 CPUs with 64x4096 of shared memory
IPAM: IPv4: 0/62 allocated from 10.3.48.192/26,
IPv6 BIG TCP: Disabled
BandwidthManager: EDT with BPF [CUBIC] [eth0]
Host Routing: Legacy
Masquerading: Disabled
Controller Status: 77/77 healthy
Proxy Status: OK, ip 169.254.4.6, 0 redirects active on ports 10000-20000
Global Identity Range: min 256, max 65535
Hubble: Ok Current/Max Flows: 63/63 (100.00%), Flows/s: 27.13 Metrics: Ok
Encryption: Disabled
Cluster health: Probe disabled
Kernel Version
6.1.75+ #1 SMP PREEMPT_DYNAMIC Sat Mar 30 14:38:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Kubernetes Version
Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4-gke.1043000
Anything else?
The cluster is in GKE with Dataplane V2 enabled. I don't have control over the cilium agents, it's managed by google. This is only happening on my cluster in the rapid channel, so I'm not sure if it's a bug or some incompatibility between the cilium version and the services affected.
I can't replicate the error and have no clue about the root cause. Any help will be appreciated, thanks in advance!!
Code of Conduct
The text was updated successfully, but these errors were encountered: