Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP connection drops with LB mode: DSR ON when trying to reach ingress endpoint from outside the cluster #32437

Closed
2 of 3 tasks
egorp500 opened this issue May 9, 2024 · 4 comments
Labels
area/loadbalancing Impacts load-balancing and Kubernetes service implementations feature/dsr Relates to Cilium's Direct-Server-Return feature for KPR. info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps.

Comments

@egorp500
Copy link

egorp500 commented May 9, 2024

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Components:

  1. VMware (vSphere)
  2. Talos Linux (v1.6.1)
  3. Client VM (Ubuntu 22.04.4 LTS)

Cilium installed with kubeProxyReplacement = true.

Also we have installed some pods with configured Ingress (HTTPS disable).

When we’re trying to initialize TCP connection from Client VM to Ingress with DSR ON - it breaks the connection. Our guess it’s because we receive a response from some Node and not from our ingress endpoint (Grafana LB).

Note. We see that response packets coming from node IP are being rejected.

Command used: curl -k http://grafana.e2.powercom.dev

Control Plane IP: 10.50.1.193

Talos-node-01 IP: 10.50.1.194

Talos-node-02 IP: 10.50.1.195

Talos-node-03 IP: 10.50.1.196

Grafana LB IP: 10.50.1.209

We use Cilium L2 LB CRD

  1. CiliumL2AnnouncementPolicy
    CiliumL2AnnouncementPolicy.txt
  2. CiliumLoadBalancerIPPool
    CiliumLoadBalancerIPPool.txt

Screenshot and dumpfile of tcpdump
tcpdump
tcpdump_dsr_on.zip

Important: Nginx Ingress Controller is on Talos-node-01, so the response is from 10.50.1.194

Cilium Version

Client: 1.15.4 9b3f9a8 2024-04-11T17:25:42-04:00 go version go1.21.9 linux/amd64
Daemon: 1.15.4 9b3f9a8 2024-04-11T17:25:42-04:00 go version go1.21.9 linux/amd64

Cilium Helm chart: 1.15.4

Kernel Version

Linux EEOPE2_talos-01 6.1.69-talos #1 SMP PREEMPT_DYNAMIC Thu Dec 21 15:48:53 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Linux EEOPE2_talos-02 6.1.69-talos #1 SMP PREEMPT_DYNAMIC Thu Dec 21 15:48:53 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Linux EEOPE2_talos-03 6.1.69-talos #1 SMP PREEMPT_DYNAMIC Thu Dec 21 15:48:53 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

Client Version: v1.29.4
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.0

Regression

No response

Sysdump

cilium-sysdump-20240508-152222.zip

Relevant log output

No response

Anything else?

This archive includes additional cilium-config and cilium-pod-config files.
cilium-configs.zip

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct
@egorp500 egorp500 added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels May 9, 2024
@squeed
Copy link
Contributor

squeed commented May 16, 2024

This certainly looks like an issue.

Could you try and create a minimal reproducer? Ideally on kind, but other environments would be acceptable. Does it have to be a LoadBalancer service, or are other types of kubernetes services affected?

@squeed squeed added need-more-info More information is required to further debug or fix the issue. area/loadbalancing Impacts load-balancing and Kubernetes service implementations feature/dsr Relates to Cilium's Direct-Server-Return feature for KPR. labels May 16, 2024
@julianwiedmann
Copy link
Member

This might be #32189 - are you using DSR with Geneve dispatch, but BPF masquerading is disabled?

@egorp500
Copy link
Author

This might be #32189 - are you using DSR with Geneve dispatch, but BPF masquerading is disabled?

As you can see in configuration I attached:

  1. DSR ON with Geneve
  2. bpf.masquerading false by default - docs
  3. Masquerading (ipv4/ipv6) ON
  4. Masquerading-to-route-source OFF, but it's for iptables. Ref - docs
    image

@github-actions github-actions bot added info-completed The GH issue has received a reply from the author and removed need-more-info More information is required to further debug or fix the issue. labels May 16, 2024
@julianwiedmann
Copy link
Member

Cool, thanks for confirming! Then yes, let's dup against #32189.

@julianwiedmann julianwiedmann closed this as not planned Won't fix, can't repro, duplicate, stale May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/loadbalancing Impacts load-balancing and Kubernetes service implementations feature/dsr Relates to Cilium's Direct-Server-Return feature for KPR. info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps.
Projects
None yet
Development

No branches or pull requests

3 participants