Unable to create endpoint: Cilium API client timeout exceeded #32399

FranAguiar · 2024-05-07T14:44:51Z

Is there an existing issue for this?

I have searched the existing issues

What happened?

Since a few weeks ago, some pods in my GKE cluster have been getting stuck in the PodCreating state. When I run a kubectl pod describe, I get this error:

 Warning  FailedCreatePodSandBox  4m45s (x138 over 4h)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "6eec890c6d2dbfefc3fa6ab1bf8db4f81ccb0c2f53ad757fb8f573a2bf9eca68": plugin type="cilium-cni" failed (add): unable to create endpoint: Cilium API client timeout exceeded

And in the logs of the cilium-agent container I found this:

{"containerID":"","datapathPolicyRevision":0,"desiredPolicyRevision":0,"endpointID":2588,"error":"timeout while waiting for initial endpoint generation to complete: context canceled","ipv4":"","ipv6":"","k8sPodName":"/","level":"warning","msg":"Creation of endpoint failed","subsys":"daemon"}

{"containerID":"","datapathPolicyRevision":0,"desiredPolicyRevision":0,"endpointID":795,"error":"unable to resolve identity: failed to assign a global identity for lables: k8s:app.kubernetes.io/component=prometheus,k8s:app.kubernetes.io/instance=monitoring-kube-prometheus-prometheus,k8s:app.kubernetes.io/managed-by=prometheus-operator,k8s:app.kubernetes.io/name=prometheus,k8s:app.kubernetes.io/version=2.48.1,k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=monitoring,k8s:io.cilium.k8s.policy.cluster=default,k8s:io.cilium.k8s.policy.serviceaccount=monitoring-kube-prometheus-prometheus,k8s:io.kubernetes.pod.namespace=monitoring,k8s:operator.prometheus.io/name=monitoring-kube-prometheus-prometheus,k8s:operator.prometheus.io/shard=0,k8s:prometheus=monitoring-kube-prometheus-prometheus","identityLabels":{"app.kubernetes.io/component":{"key":"app.kubernetes.io/component","value":"prometheus","source":"k8s"},"app.kubernetes.io/instance":{"key":"app.kubernetes.io/instance","value":"monitoring-kube-prometheus-prometheus","source":"k8s"},"app.kubernetes.io/managed-by":{"key":"app.kubernetes.io/managed-by","value":"prometheus-operator","source":"k8s"},"app.kubernetes.io/name":{"key":"app.kubernetes.io/name","value":"prometheus","source":"k8s"},"app.kubernetes.io/version":{"key":"app.kubernetes.io/version","value":"2.48.1","source":"k8s"},"io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name":{"key":"io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name","value":"monitoring","source":"k8s"},"io.cilium.k8s.policy.cluster":{"key":"io.cilium.k8s.policy.cluster","value":"default","source":"k8s"},"io.cilium.k8s.policy.serviceaccount":{"key":"io.cilium.k8s.policy.serviceaccount","value":"monitoring-kube-prometheus-prometheus","source":"k8s"},"io.kubernetes.pod.namespace":{"key":"io.kubernetes.pod.namespace","value":"monitoring","source":"k8s"},"operator.prometheus.io/name":{"key":"operator.prometheus.io/name","value":"monitoring-kube-prometheus-prometheus","source":"k8s"},"operator.prometheus.io/shard":{"key":"operator.prometheus.io/shard","value":"0","source":"k8s"},"prometheus":{"key":"prometheus","value":"monitoring-kube-prometheus-prometheus","source":"k8s"}},"ipv4":"","ipv6":"","k8sPodName":"/","level":"warning","msg":"Error changing endpoint identity","subsys":"endpoint"}

If I manually delete the pod it start without any issue.

Cilium Version

Client: 1.13.12 38d04fa903 2024-04-05T00:06:43+00:00 go version go1.21.8 linux/amd64
Daemon: 1.13.12 38d04fa903 2024-04-05T00:06:43+00:00 go version go1.21.8 linux/amd64

KVStore: Ok Disabled
Kubernetes: Ok 1.29 (v1.29.4-gke.1043000) [linux/amd64]
Kubernetes APIs: ["cilium/v2::CiliumLocalRedirectPolicy", "cilium/v2::CiliumNode", "cilium/v2alpha1::CiliumEndpointSlice", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement: Strict [eth0 10.3.2.219 (Direct Routing)]
Host firewall: Disabled
CNI Chaining: generic-veth
CNI Config file: CNI configuration file management disabled
Cilium: Ok 1.13.12 (v1.13.12-38d04fa903)
NodeMonitor: Listening for events on 8 CPUs with 64x4096 of shared memory
IPAM: IPv4: 0/62 allocated from 10.3.48.192/26,
IPv6 BIG TCP: Disabled
BandwidthManager: EDT with BPF [CUBIC] [eth0]
Host Routing: Legacy
Masquerading: Disabled
Controller Status: 77/77 healthy
Proxy Status: OK, ip 169.254.4.6, 0 redirects active on ports 10000-20000
Global Identity Range: min 256, max 65535
Hubble: Ok Current/Max Flows: 63/63 (100.00%), Flows/s: 27.13 Metrics: Ok
Encryption: Disabled
Cluster health: Probe disabled

Kernel Version

6.1.75+ #1 SMP PREEMPT_DYNAMIC Sat Mar 30 14:38:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4-gke.1043000

Anything else?

The cluster is in GKE with Dataplane V2 enabled. I don't have control over the cilium agents, it's managed by google. This is only happening on my cluster in the rapid channel, so I'm not sure if it's a bug or some incompatibility between the cilium version and the services affected.

I can't replicate the error and have no clue about the root cause. Any help will be appreciated, thanks in advance!!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

Sh4d1 · 2024-05-10T12:09:35Z

Hey, I literally got the same issue, I also opened a support case on GCP, see if it helps, I'll update here if I have some news!

FranAguiar · 2024-05-10T12:26:44Z

Oh, that's great. Please keep me posted.
I also noticed that this only happens to stuff deployed with Helm (maybe it's just a coincidence, but who knows). Is that the case for you too?

Sh4d1 · 2024-05-10T12:28:56Z

Will do! Nope it's deployed with pulumi for me (through native k8s objects).
Do you have a lot of pod changes ? (Like deploys, HPA that are often scaling, ...)

FranAguiar · 2024-05-10T12:41:18Z

No, not in particular. Especially for RabbitMQ, the service that is suffering the most from this issue.

r0bj · 2024-05-10T16:07:14Z

We encountered a similar issue after upgrading our GKE cluster from 1.29.1-gke.1589000 to 1.29.4-gke.1447000. In our case, downgrading the cluster helped, and we were able to fix the issue. @Sh4d1, any luck with GCP support?

Sh4d1 · 2024-05-10T16:09:11Z

Interesting! I'm on 1.29.3. And no luck with the support yet (I've linked them the issue as well).

harispraba · 2024-05-13T08:09:45Z

in my case, this happens after node upgrade since we use rapid channel. but not always, sometimes pod recreated successfully.

also, for now this happens only to statefulsets.

FranAguiar · 2024-05-13T09:29:55Z

@harispraba
Now that you mention it, the services that are experiencing this error in my cluster are also statefulsets. Rabbitmq, thanos-storage, and prometheus.

@Sh4d1
Hi, any news from GCP support?
Is the ticket public or is it private only with you?

Sh4d1 · 2024-05-13T09:40:09Z

Hum I think it's only happened to statefull sets on my end as well!

@FranAguiar it's private, and no luck yet (they asked for the whole logs from cilium but I don't have it anymore, so waiting for next occurrence to catch it)

squeed · 2024-05-15T11:36:34Z

Assigning to @christarazi, who recently worked on endpoint regeneration and statefulset updates.

voltagebots · 2024-05-15T18:34:37Z

recently faced this as well. In my case, it turns out that Cilium fails to create an endpoint due to the number of character which should be no more than 63.
Looking at the Cilium agent logs on the node; we see

[CiliumIdentity.cilium.io](http://ciliumidentity.cilium.io/) \"47595\" is invalid: metadata.labels: Invalid value: \"xxxxxxx": must be no more than 63 characters","key":{"LabelArray":

renaming the chart name to a fewer characters fixed it.

christarazi · 2024-05-15T19:09:59Z

Can you try v1.15.5? It contains #31605 which might resolve this problem.

FranAguiar · 2024-05-16T09:36:19Z

Hello, I just update my GKE cluster to latest version

Server Version: v1.30.0-gke.1457000

And it comes with new cilium version

Client: 1.14.7 47ecffbb57 2024-04-25T17:12:33-07:00 go version go1.21.8 linux/amd64
Daemon: 1.14.7 47ecffbb57 2024-04-25T17:12:33-07:00 go version go1.21.8 linux/amd64

I hope that solve the issue

JJGadgets · 2024-05-16T15:33:57Z

I have the issue described in OP on 1.15.5 on my homelab baremetal Talos cluster, but it seems to only happen on one node. All pods on that node fails to schedule and the only useful logs are the exact ones reported in OP.

Killing the Cilium pod on that node fixes it for a while until it happens again, and sometimes it happens right from the start of that Cilium pod's lifetime and that pod needds to be killed too. cilium-dbg status --verbose shows that this node's endpoints are unreachable, and the other 2 nodes are fine.

Versions:
Talos: 1.6.4
Kubernetes: 1.29.2
Cilium: 1.15.5

Cilium Helm values (these 2 files get merged by Flux Helm controller, and the hr.yaml will override the config/biohazard/helm-values.yaml if there are conflicting values): https://github.com/JJGadgets/Biohazard/blob/7004140fc1be893e1e35dac1d43148af749eb8da/kube/deploy/core/_networking/cilium/app/hr.yaml https://github.com/JJGadgets/Biohazard/blob/7004140fc1be893e1e35dac1d43148af749eb8da/kube/deploy/core/_networking/cilium/app/config/biohazard/helm-values.yaml

christarazi · 2024-05-16T17:37:57Z

@JJGadgets Are the workloads statefulsets? If so, please provide the Cilium logs when that occurs.

JJGadgets · 2024-05-16T17:42:52Z

@christarazi Nope, everything from deployments, daemonsets, jobs, to KubeVirt VMs (which is a custom controller AFAIK).

christarazi · 2024-05-16T17:45:26Z

@JJGadgets Ok, that sounds like a separate issue from this thread. It seems that the initial report is for statefulsets. I would encourage you to file a new issue with a sysdump of when the issue occurred.

JJGadgets · 2024-05-16T17:46:29Z

@christarazi will create the separate issue when I encounter the issue again, for now the node and its Cilium pod is happy.

FranAguiar · 2024-05-17T08:36:48Z

Hello, I just update my GKE cluster to latest version
Server Version: v1.30.0-gke.1457000
And it comes with new cilium version
Client: 1.14.7 47ecffbb57 2024-04-25T17:12:33-07:00 go version go1.21.8 linux/amd64
Daemon: 1.14.7 47ecffbb57 2024-04-25T17:12:33-07:00 go version go1.21.8 linux/amd64
I hope that solve the issue

Just happen again, this is the log from the cilium container
Explore-logs-2024-05-17 09_20_52.txt

@Sh4d1 Share this log with google support if you want

christarazi · 2024-05-17T17:20:20Z

Any version below 1.15.5 will not have the statefulset fix, so please try upgrading to that.

r0bj · 2024-05-17T19:46:06Z

In my case, not only were stateful sets affected, but deployments were as well.

christarazi · 2024-05-17T19:56:27Z

@r0bj That sounds like a separate issue as mentioned in #32399 (comment)

liquidiert · 2024-05-26T18:18:40Z

Hey! I have a similiar problem with some of my pods in a AWS cluster. This is the error message:

kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "f40f8a986e4c53a727336b51fb698fee152f06b3357da0079a5ed204ed7d22a0": plugin type="cilium-cni" failed (add): unable to create endpoint: Cilium API client timeout exceeded

Cilium verison is 1.16.0-dev. Anyone encountered the same issue with that verison as well?

FranAguiar added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels May 7, 2024

squeed assigned christarazi May 15, 2024

This was referenced May 21, 2024

Cilium version bump to 1.15.5 rancher/rke2#5935

Merged

Bump Cilium to 1.15.5 rancher/rke2#5936

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to create endpoint: Cilium API client timeout exceeded #32399

Unable to create endpoint: Cilium API client timeout exceeded #32399

FranAguiar commented May 7, 2024 •

edited

Sh4d1 commented May 10, 2024

FranAguiar commented May 10, 2024

Sh4d1 commented May 10, 2024

FranAguiar commented May 10, 2024

r0bj commented May 10, 2024

Sh4d1 commented May 10, 2024

harispraba commented May 13, 2024

FranAguiar commented May 13, 2024

Sh4d1 commented May 13, 2024

squeed commented May 15, 2024

voltagebots commented May 15, 2024

christarazi commented May 15, 2024

FranAguiar commented May 16, 2024

JJGadgets commented May 16, 2024

christarazi commented May 16, 2024 •

edited

JJGadgets commented May 16, 2024

christarazi commented May 16, 2024 •

edited

JJGadgets commented May 16, 2024

FranAguiar commented May 17, 2024 •

edited

christarazi commented May 17, 2024

r0bj commented May 17, 2024

christarazi commented May 17, 2024

liquidiert commented May 26, 2024

Unable to create endpoint: Cilium API client timeout exceeded #32399

Unable to create endpoint: Cilium API client timeout exceeded #32399

Comments

FranAguiar commented May 7, 2024 • edited

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Anything else?

Code of Conduct

Sh4d1 commented May 10, 2024

FranAguiar commented May 10, 2024

Sh4d1 commented May 10, 2024

FranAguiar commented May 10, 2024

r0bj commented May 10, 2024

Sh4d1 commented May 10, 2024

harispraba commented May 13, 2024

FranAguiar commented May 13, 2024

Sh4d1 commented May 13, 2024

squeed commented May 15, 2024

voltagebots commented May 15, 2024

christarazi commented May 15, 2024

FranAguiar commented May 16, 2024

JJGadgets commented May 16, 2024

christarazi commented May 16, 2024 • edited

JJGadgets commented May 16, 2024

christarazi commented May 16, 2024 • edited

JJGadgets commented May 16, 2024

FranAguiar commented May 17, 2024 • edited

christarazi commented May 17, 2024

r0bj commented May 17, 2024

christarazi commented May 17, 2024

liquidiert commented May 26, 2024

FranAguiar commented May 7, 2024 •

edited

christarazi commented May 16, 2024 •

edited

christarazi commented May 16, 2024 •

edited

FranAguiar commented May 17, 2024 •

edited