Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to create endpoint: Cilium API client timeout exceeded #32399

Open
2 tasks done
FranAguiar opened this issue May 7, 2024 · 23 comments
Open
2 tasks done

Unable to create endpoint: Cilium API client timeout exceeded #32399

FranAguiar opened this issue May 7, 2024 · 23 comments
Assignees
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps.

Comments

@FranAguiar
Copy link

FranAguiar commented May 7, 2024

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Since a few weeks ago, some pods in my GKE cluster have been getting stuck in the PodCreating state. When I run a kubectl pod describe, I get this error:

 Warning  FailedCreatePodSandBox  4m45s (x138 over 4h)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "6eec890c6d2dbfefc3fa6ab1bf8db4f81ccb0c2f53ad757fb8f573a2bf9eca68": plugin type="cilium-cni" failed (add): unable to create endpoint: Cilium API client timeout exceeded

And in the logs of the cilium-agent container I found this:

{"containerID":"","datapathPolicyRevision":0,"desiredPolicyRevision":0,"endpointID":2588,"error":"timeout while waiting for initial endpoint generation to complete: context canceled","ipv4":"","ipv6":"","k8sPodName":"/","level":"warning","msg":"Creation of endpoint failed","subsys":"daemon"}
{"containerID":"","datapathPolicyRevision":0,"desiredPolicyRevision":0,"endpointID":795,"error":"unable to resolve identity: failed to assign a global identity for lables: k8s:app.kubernetes.io/component=prometheus,k8s:app.kubernetes.io/instance=monitoring-kube-prometheus-prometheus,k8s:app.kubernetes.io/managed-by=prometheus-operator,k8s:app.kubernetes.io/name=prometheus,k8s:app.kubernetes.io/version=2.48.1,k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=monitoring,k8s:io.cilium.k8s.policy.cluster=default,k8s:io.cilium.k8s.policy.serviceaccount=monitoring-kube-prometheus-prometheus,k8s:io.kubernetes.pod.namespace=monitoring,k8s:operator.prometheus.io/name=monitoring-kube-prometheus-prometheus,k8s:operator.prometheus.io/shard=0,k8s:prometheus=monitoring-kube-prometheus-prometheus","identityLabels":{"app.kubernetes.io/component":{"key":"app.kubernetes.io/component","value":"prometheus","source":"k8s"},"app.kubernetes.io/instance":{"key":"app.kubernetes.io/instance","value":"monitoring-kube-prometheus-prometheus","source":"k8s"},"app.kubernetes.io/managed-by":{"key":"app.kubernetes.io/managed-by","value":"prometheus-operator","source":"k8s"},"app.kubernetes.io/name":{"key":"app.kubernetes.io/name","value":"prometheus","source":"k8s"},"app.kubernetes.io/version":{"key":"app.kubernetes.io/version","value":"2.48.1","source":"k8s"},"io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name":{"key":"io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name","value":"monitoring","source":"k8s"},"io.cilium.k8s.policy.cluster":{"key":"io.cilium.k8s.policy.cluster","value":"default","source":"k8s"},"io.cilium.k8s.policy.serviceaccount":{"key":"io.cilium.k8s.policy.serviceaccount","value":"monitoring-kube-prometheus-prometheus","source":"k8s"},"io.kubernetes.pod.namespace":{"key":"io.kubernetes.pod.namespace","value":"monitoring","source":"k8s"},"operator.prometheus.io/name":{"key":"operator.prometheus.io/name","value":"monitoring-kube-prometheus-prometheus","source":"k8s"},"operator.prometheus.io/shard":{"key":"operator.prometheus.io/shard","value":"0","source":"k8s"},"prometheus":{"key":"prometheus","value":"monitoring-kube-prometheus-prometheus","source":"k8s"}},"ipv4":"","ipv6":"","k8sPodName":"/","level":"warning","msg":"Error changing endpoint identity","subsys":"endpoint"}

If I manually delete the pod it start without any issue.

Cilium Version

Client: 1.13.12 38d04fa903 2024-04-05T00:06:43+00:00 go version go1.21.8 linux/amd64
Daemon: 1.13.12 38d04fa903 2024-04-05T00:06:43+00:00 go version go1.21.8 linux/amd64

KVStore: Ok Disabled
Kubernetes: Ok 1.29 (v1.29.4-gke.1043000) [linux/amd64]
Kubernetes APIs: ["cilium/v2::CiliumLocalRedirectPolicy", "cilium/v2::CiliumNode", "cilium/v2alpha1::CiliumEndpointSlice", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement: Strict [eth0 10.3.2.219 (Direct Routing)]
Host firewall: Disabled
CNI Chaining: generic-veth
CNI Config file: CNI configuration file management disabled
Cilium: Ok 1.13.12 (v1.13.12-38d04fa903)
NodeMonitor: Listening for events on 8 CPUs with 64x4096 of shared memory
IPAM: IPv4: 0/62 allocated from 10.3.48.192/26,
IPv6 BIG TCP: Disabled
BandwidthManager: EDT with BPF [CUBIC] [eth0]
Host Routing: Legacy
Masquerading: Disabled
Controller Status: 77/77 healthy
Proxy Status: OK, ip 169.254.4.6, 0 redirects active on ports 10000-20000
Global Identity Range: min 256, max 65535
Hubble: Ok Current/Max Flows: 63/63 (100.00%), Flows/s: 27.13 Metrics: Ok
Encryption: Disabled
Cluster health: Probe disabled

Kernel Version

6.1.75+ #1 SMP PREEMPT_DYNAMIC Sat Mar 30 14:38:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4-gke.1043000

Anything else?

The cluster is in GKE with Dataplane V2 enabled. I don't have control over the cilium agents, it's managed by google. This is only happening on my cluster in the rapid channel, so I'm not sure if it's a bug or some incompatibility between the cilium version and the services affected.

I can't replicate the error and have no clue about the root cause. Any help will be appreciated, thanks in advance!!

Code of Conduct

  • I agree to follow this project's Code of Conduct
@FranAguiar FranAguiar added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels May 7, 2024
@Sh4d1
Copy link
Contributor

Sh4d1 commented May 10, 2024

Hey, I literally got the same issue, I also opened a support case on GCP, see if it helps, I'll update here if I have some news!

@FranAguiar
Copy link
Author

Oh, that's great. Please keep me posted.
I also noticed that this only happens to stuff deployed with Helm (maybe it's just a coincidence, but who knows). Is that the case for you too?

@Sh4d1
Copy link
Contributor

Sh4d1 commented May 10, 2024

Will do! Nope it's deployed with pulumi for me (through native k8s objects).
Do you have a lot of pod changes ? (Like deploys, HPA that are often scaling, ...)

@FranAguiar
Copy link
Author

No, not in particular. Especially for RabbitMQ, the service that is suffering the most from this issue.

@r0bj
Copy link

r0bj commented May 10, 2024

We encountered a similar issue after upgrading our GKE cluster from 1.29.1-gke.1589000 to 1.29.4-gke.1447000. In our case, downgrading the cluster helped, and we were able to fix the issue. @Sh4d1, any luck with GCP support?

@Sh4d1
Copy link
Contributor

Sh4d1 commented May 10, 2024

Interesting! I'm on 1.29.3. And no luck with the support yet (I've linked them the issue as well).

@harispraba
Copy link

in my case, this happens after node upgrade since we use rapid channel. but not always, sometimes pod recreated successfully.

also, for now this happens only to statefulsets.

@FranAguiar
Copy link
Author

@harispraba
Now that you mention it, the services that are experiencing this error in my cluster are also statefulsets. Rabbitmq, thanos-storage, and prometheus.

@Sh4d1
Hi, any news from GCP support?
Is the ticket public or is it private only with you?

@Sh4d1
Copy link
Contributor

Sh4d1 commented May 13, 2024

Hum I think it's only happened to statefull sets on my end as well!

@FranAguiar it's private, and no luck yet (they asked for the whole logs from cilium but I don't have it anymore, so waiting for next occurrence to catch it)

@squeed
Copy link
Contributor

squeed commented May 15, 2024

Assigning to @christarazi, who recently worked on endpoint regeneration and statefulset updates.

@voltagebots
Copy link

recently faced this as well. In my case, it turns out that Cilium fails to create an endpoint due to the number of character which should be no more than 63.
Looking at the Cilium agent logs on the node; we see

[CiliumIdentity.cilium.io](http://ciliumidentity.cilium.io/) \"47595\" is invalid: metadata.labels: Invalid value: \"xxxxxxx": must be no more than 63 characters","key":{"LabelArray":

renaming the chart name to a fewer characters fixed it.

@christarazi
Copy link
Member

Can you try v1.15.5? It contains #31605 which might resolve this problem.

@FranAguiar
Copy link
Author

Hello, I just update my GKE cluster to latest version

Server Version: v1.30.0-gke.1457000

And it comes with new cilium version

Client: 1.14.7 47ecffbb57 2024-04-25T17:12:33-07:00 go version go1.21.8 linux/amd64
Daemon: 1.14.7 47ecffbb57 2024-04-25T17:12:33-07:00 go version go1.21.8 linux/amd64

I hope that solve the issue

@JJGadgets
Copy link

I have the issue described in OP on 1.15.5 on my homelab baremetal Talos cluster, but it seems to only happen on one node. All pods on that node fails to schedule and the only useful logs are the exact ones reported in OP.

Killing the Cilium pod on that node fixes it for a while until it happens again, and sometimes it happens right from the start of that Cilium pod's lifetime and that pod needds to be killed too. cilium-dbg status --verbose shows that this node's endpoints are unreachable, and the other 2 nodes are fine.

Versions:
Talos: 1.6.4
Kubernetes: 1.29.2
Cilium: 1.15.5

Cilium Helm values (these 2 files get merged by Flux Helm controller, and the hr.yaml will override the config/biohazard/helm-values.yaml if there are conflicting values): https://github.com/JJGadgets/Biohazard/blob/7004140fc1be893e1e35dac1d43148af749eb8da/kube/deploy/core/_networking/cilium/app/hr.yaml https://github.com/JJGadgets/Biohazard/blob/7004140fc1be893e1e35dac1d43148af749eb8da/kube/deploy/core/_networking/cilium/app/config/biohazard/helm-values.yaml

@christarazi
Copy link
Member

christarazi commented May 16, 2024

@JJGadgets Are the workloads statefulsets? If so, please provide the Cilium logs when that occurs.

@JJGadgets
Copy link

@christarazi Nope, everything from deployments, daemonsets, jobs, to KubeVirt VMs (which is a custom controller AFAIK).

@christarazi
Copy link
Member

christarazi commented May 16, 2024

@JJGadgets Ok, that sounds like a separate issue from this thread. It seems that the initial report is for statefulsets. I would encourage you to file a new issue with a sysdump of when the issue occurred.

@JJGadgets
Copy link

@christarazi will create the separate issue when I encounter the issue again, for now the node and its Cilium pod is happy.

@FranAguiar
Copy link
Author

FranAguiar commented May 17, 2024

Hello, I just update my GKE cluster to latest version

Server Version: v1.30.0-gke.1457000

And it comes with new cilium version

Client: 1.14.7 47ecffbb57 2024-04-25T17:12:33-07:00 go version go1.21.8 linux/amd64
Daemon: 1.14.7 47ecffbb57 2024-04-25T17:12:33-07:00 go version go1.21.8 linux/amd64

I hope that solve the issue

Just happen again, this is the log from the cilium container
Explore-logs-2024-05-17 09_20_52.txt

@Sh4d1 Share this log with google support if you want

@christarazi
Copy link
Member

Any version below 1.15.5 will not have the statefulset fix, so please try upgrading to that.

@r0bj
Copy link

r0bj commented May 17, 2024

In my case, not only were stateful sets affected, but deployments were as well.

@christarazi
Copy link
Member

@r0bj That sounds like a separate issue as mentioned in #32399 (comment)

@liquidiert
Copy link

Hey! I have a similiar problem with some of my pods in a AWS cluster. This is the error message:

kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "f40f8a986e4c53a727336b51fb698fee152f06b3357da0079a5ed204ed7d22a0": plugin type="cilium-cni" failed (add): unable to create endpoint: Cilium API client timeout exceeded

Cilium verison is 1.16.0-dev. Anyone encountered the same issue with that verison as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps.
Projects
None yet
Development

No branches or pull requests

9 participants