Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stale service conntrack entries causing packet drop #32472

Closed
1 task done
sypakine opened this issue May 10, 2024 · 13 comments
Closed
1 task done

Stale service conntrack entries causing packet drop #32472

sypakine opened this issue May 10, 2024 · 13 comments
Labels
area/loadbalancing Impacts load-balancing and Kubernetes service implementations feature/conntrack info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.

Comments

@sypakine
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Intermittent occurrence of Service backend not found errors:

$ cilium monitor --type drop
...
xx drop (Service backend not found) flow 0x7558de48 to endpoint 0, file bpf_lxc.c line 1480, , identity 44899->unknown: 10.196.161.9:45318 -> 34.1.1.100:443 tcp SYN
xx drop (Service backend not found) flow 0x6de3ccb3 to endpoint 0, file bpf_lxc.c line 1480, , identity 44899->unknown: 10.196.161.9:45318 -> 34.1.1.100:443 tcp SYN
xx drop (Service backend not found) flow 0xb913610f to endpoint 0, file bpf_lxc.c line 1480, , identity 44899->unknown: 10.196.161.9:45318 -> 34.1.1.100:443 tcp SYN

However, there are backends for the service (and other connections to 34.1.1.100:443 from the same source pod work as expected):

# cilium bpf lb list | grep 34.1.1.100:443/i -A 6
34.1.1.100:443/i     10.196.158.2:8443 (88) (8)
                        10.196.179.3:8443 (88) (9)
                        10.196.177.24:8443 (88) (4)
                        10.196.128.2:8443 (88) (6)
                        0.0.0.0:0 (88) (0) [LoadBalancer, Local]
                        10.196.155.225:8443 (88) (7)
                        10.196.179.2:8443 (88) (10)

Turned on datapath debugging and discovered this is due to a stale service conntrack entries:

<- endpoint 1804 flow 0xb00723b3 , identity 44899->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.196.161.9:45318 -> 34.1.1.100:443 tcp SYN
CPU 04: MARK 0xb00723b3 FROM 1804 DEBUG: Conntrack lookup 1/2: src=10.196.161.9:45318 dst=34.1.1.100:443
CPU 04: MARK 0xb00723b3 FROM 1804 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=4
CPU 04: MARK 0xb00723b3 FROM 1804 DEBUG: CT entry found lifetime=20637745, revnat=88
CPU 04: MARK 0xb00723b3 FROM 1804 DEBUG: CT verdict: Reply, revnat=88
CPU 04: MARK 0xb00723b3 FROM 1804 DEBUG: Backend service lookup failed: backend_id=272
xx drop (Service backend not found) flow 0xb00723b3 to endpoint 0, file bpf_lxc.c line 1480, , identity 44899->unknown: 10.196.161.9:45318 -> 34.1.1.100:443 tcp SYN

Note that the connection from the client is a SYN segment, but the conntrack lookup sees it as an open connection:

So dumped the service conntrack entry and waited for another request with that connection tuple:

Before:

$ cilium bpf ct list global -Dd | grep "TCP OUT 34.1.1.100443 -> 10.196.161.9:" | grep "\[ SeenNonSyn \]" | grep "10.196.161.9:45318"

TCP OUT 34.1.1.100:443 -> 10.196.161.9:45318 service expires=25446799 (remaining: 19197 sec(s)) RxPackets=0 RxBytes=314 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1a LastTxReport=25362412 Flags=0x0010 [ SeenNonSyn ] RevNAT=88 SourceSecurityID=0 IfIndex=0

Observe flow for tuple:

$ cilium monitor --related-to 1123 | grep "45318"
Listening for events on 16 CPUs with 64x4096 of shared memory
Press Ctrl-C to quit
level=info msg="Initializing dissection cache..." subsys=monitor

xx drop (Service backend not found) flow 0x45ed4e5f to endpoint 0, file bpf_lxc.c line 1480, , identity 44899->unknown: 10.196.161.9:45318 -> 34.1.1.100:443 tcp SYN
xx drop (Service backend not found) flow 0x59b66eb5 to endpoint 0, file bpf_lxc.c line 1480, , identity 44899->unknown: 10.196.161.9:45318 -> 34.1.1.100:443 tcp SYN
xx drop (Service backend not found) flow 0xb5441506 to endpoint 0, file bpf_lxc.c line 1480, , identity 44899->unknown: 10.196.161.9:45318 -> 34.1.1.100:443 tcp SYN
^C
Received an interrupt, disconnecting from monitor...

After:

$ cilium bpf ct list global -Dd | grep "TCP OUT 34.1.1.100:443 -> 10.196.161.9:" | grep "\[ SeenNonSyn \]" | grep "10.196.161.9:45318"

TCP OUT 34.1.1.100:443 -> 10.196.161.9:45318 service expires=25456278 (remaining: 21578 sec(s)) RxPackets=0 RxBytes=314 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1a LastTxReport=25371891 Flags=0x0010 [ SeenNonSyn ] RevNAT=88 SourceSecurityID=0 IfIndex=0

So it looks like even with a service backend miss, there is no feedback to the conntrack state to invalidate the entry. If a client is making frequent short-lived connections to the service, there is a high probability they will reuse the connection tuple as they cycle through the source ports.


One observation is that a SYN segment seen for an open CT conntrack entry is OK, as long as the service lookup resolves to an existing backend index:

<- endpoint 1804 flow 0x2847f5e0 , identity 44899->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.196.161.9:56744 -> 34.1.1.100:443 tcp SYN
CPU 07: MARK 0x2847f5e0 FROM 1804 DEBUG: Conntrack lookup 1/2: src=10.196.161.9:56744 dst=34.1.1.100:443
CPU 07: MARK 0x2847f5e0 FROM 1804 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=4
CPU 07: MARK 0x2847f5e0 FROM 1804 DEBUG: CT entry found lifetime=20657581, revnat=88
CPU 07: MARK 0x2847f5e0 FROM 1804 DEBUG: CT verdict: Reply, revnat=88
# No logging for it, but means service backend was found for the index
CPU 07: MARK 0x2847f5e0 FROM 1804 DEBUG: Conntrack lookup 1/2: src=10.196.161.9:56744 dst=10.196.169.6:8443
CPU 07: MARK 0x2847f5e0 FROM 1804 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=1
CPU 07: MARK 0x2847f5e0 FROM 1804 DEBUG: CT entry found lifetime=20573245, revnat=88
CPU 07: MARK 0x2847f5e0 FROM 1804 DEBUG: CT verdict: Interface Decrypted, revnat=88
CPU 01: MARK 0x2847f5e0 FROM 1804 DEBUG: Successfully mapped addr=10.196.169.6 to identity=782
CPU 01: MARK 0x2847f5e0 FROM 1804 DEBUG: Matched L4 policy; creating conntrack src=782 dst=44899 dport=8443 proto=6
Policy verdict log: flow 0x2847f5e0 local EP ID 1804, remote ID 782, proto 6, egress, action allow, match L3-L4, 10.196.161.9:56744 -> 10.196.169.6:8443 tcp SYN
-> stack flow 0x2847f5e0 , identity 44899->782 state reopened ifindex 0 orig-ip 0.0.0.0: 10.196.161.9:56744 -> 10.196.169.6:8443 tcp SYN
...

### Cilium Version

Client: 1.12.10 c0f319ad22 2024-02-06T18:46:58+00:00 go version go1.20.14 linux/amd64

### Kernel Version

Linux node_name 5.15.146+ #1 SMP Sat Feb 17 13:12:02 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

### Kubernetes Version

Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.11-gke.1062001", GitCommit:"2cefeadcb4ec7d21d775d15012b02d3393a53548", GitTreeState:"clean", BuildDate:"2024-04-16T20:17:53Z", GoVersion:"go1.21.7 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}

### Regression

_No response_

### Sysdump

_No response_

### Relevant log output

_No response_

### Anything else?

I have been unable to reproduce the issue. My attempt has been to:

1. Sccale service backend deployment to 100 (to create many backends to be indexed in the service map)
2. Create a long-lived TCP connection from a client to a service backend
3. After the TCP handshake, sever the connection (turn down the lxc interface)
5. Scale down the backends to 1
6. Kill the client container via `crictl`
7. Turn-up the lxc interface
8. Reuse connection tuple to create new connection

---

Also, is there a way to show the backend index for a service type conntrack entry? Even with json/yaml output I don't see it.

### Cilium Users Document

- [ ] Are you a user of Cilium? Please add yourself to the [Users doc](https://github.com/cilium/cilium/blob/main/USERS.md)

### Code of Conduct

- [X] I agree to follow this project's Code of Conduct
@sypakine sypakine added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels May 10, 2024
@julianwiedmann julianwiedmann added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. feature/conntrack area/loadbalancing Impacts load-balancing and Kubernetes service implementations labels May 13, 2024
@julianwiedmann
Copy link
Member

👋 interesting find!

Client: 1.12.10 c0f319ad22 2024-02-06T18:46:58+00:00 go version go1.20.14 linux/amd64

Could we please discuss this in context of a supported release? Ideally with a repro on latest main, the relevant code has changed a bit recently. It probably makes sense to peek at

backend = lb4_lookup_backend(ctx, backend_id);
, and see how it matches your theory / observed symptoms.

@julianwiedmann julianwiedmann added the need-more-info More information is required to further debug or fix the issue. label May 13, 2024
@sypakine
Copy link
Author

Understood.

I'm almost certain that https://sourcegraph.com/github.com/cilium/cilium/-/commit/c7d547f330cc497bad0d6a1d0f7f3785953d9a98?visible=1 will fix it as this is where we are getting the drop. This is inferred by the missing LB4 slot lookup debug messages in the logs output above.

I'll continue attempting to reproduce the issue and confirm the hypothesis above.

@github-actions github-actions bot added info-completed The GH issue has received a reply from the author and removed need-more-info More information is required to further debug or fix the issue. labels May 13, 2024
@sypakine
Copy link
Author

sypakine commented May 15, 2024

I managed to reproduce the issue on 1.15.1 -- key was to use an LB service with externalTrafficPolicy: Local.

REPO_ROOT=$PWD
KUBEPROXY_MODE="none" make kind && \
make kind-image

helm upgrade -i cilium ./install/kubernetes/cilium \
  --wait \
  --namespace kube-system \
  --set k8sServiceHost="kind-control-plane" \
  --set k8sServicePort="6443" \
  --set debug.enabled=true \
  --set pprof.enabled=true \
  --set enableIPv4Masquerade=false \
  --set enableIPv6Masquerade=false \
  --set enableIPv4Masquerade=false \
  --set hostFirewall.enabled=false \
  --set socketLB.hostNamespaceOnly=true \
  --set kubeProxyReplacement=true \
  --set nodeinit.enabled=true \
  --set ipam.mode=kubernetes \
  --set ipv4.enabled=true \
  --set ipv4NativeRoutingCIDR=10.244.0.0/16 \
  --set ipv6.enabled=false \
  --set image.override="localhost:5000/cilium/cilium-dev:local" \
  --set image.pullPolicy=Never \
  --set operator.image.override="localhost:5000/cilium/operator-generic:local" \
  --set operator.image.pullPolicy=Never \
  --set operator.image.suffix="" \
  --set securityContext.privileged=true \
  --set gatewayAPI.enabled=false \
  --set socketLB.enabled=false \
  --set bpf.hostLegacyRouting=true \
  --set endpointRoutes.enabled=true \
  --set localRedirectPolicy=true

kind export kubeconfig --name kind
alias k=kubectl

k apply -f - <<EOF
apiVersion: "cilium.io/v2alpha1"
kind: CiliumLoadBalancerIPPool
metadata:
  name: "default"
spec:
  blocks:
  - cidr: "172.19.250.0/24"
EOF

k apply -f - <<EOF
apiVersion: v1
kind: Service
metadata:
  name: server
spec:
  selector:
    app: server
  ports:
  - port: 80
    name: http
  externalTrafficPolicy: Local
  type: LoadBalancer
EOF


time while ! [[ "$( kubectl get svc server -ojsonpath='{.status.loadBalancer.ingress[].ip}' )" =~ ^172\..* ]]; do echo -n "."; sleep 1; done

k label node kind-worker app=client

k apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: client
  labels:
    app: client
spec:
  nodeSelector:
    app: client
  containers:
  - name: client
    image: nginx:latest
    command:
    - sleep
    args:
    - "999999"
EOF

k apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: server
  labels:
    app: server
spec:
  replicas: 20
  selector:
    matchLabels:
      app: server
  template:
    metadata:
      labels:
        app: server
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: client
            topologyKey: kubernetes.io/hostname
      containers:
      - name: server
        image: nginx
        ports:
        - containerPort: 80
EOF

k rollout status deployment server

host=$( kubectl get pod client \
    -ojsonpath='{ .spec.nodeName }' )
agent=$( kubectl get pod -n kube-system \
    -l k8s-app=cilium \
    --field-selector=spec.nodeName=${host} \
    -ojsonpath='{.items[].metadata.name}' )

# block FINs from leaving the pod
docker exec -it kind-worker bash

# within the kind node
container=$( crictl ps | grep $(crictl pods | grep client | awk '{ print $1 }') | awk '{ print $1 }' )
pid=$( crictl inspect ${container} | grep pid | head -1 | grep -Eo "[0-9]+" )
nsenter -t ${pid} -n bash

# within the container namespace
iptables -I OUTPUT --protocol tcp --destination-port 80 --tcp-flags FIN FIN -j DROP
exit  # container namespace

exit  # kind node

# generate traffic / ct entries
lbip=$(kubectl get svc server -ojsonpath='{.status.loadBalancer.ingress[].ip}')
k exec -it client -- curl -s -o /dev/null ${lbip}:80
k exec -it client -- curl -s -o /dev/null ${lbip}:80
k exec -it client -- curl -s -o /dev/null ${lbip}:80
k exec -it client -- curl -s -o /dev/null ${lbip}:80
k exec -it client -- curl -s -o /dev/null ${lbip}:80
k exec -it client -- curl -s -o /dev/null ${lbip}:80
k exec -it client -- curl -s -o /dev/null ${lbip}:80
k exec -it client -- curl -s -o /dev/null ${lbip}:80

# scale down deployment so we get a service backend lookup miss
kubectl patch deployment server --patch '
spec:
  replicas: 1
'

k exec -it -n kube-system ${agent} -c cilium-agent -- bash
cilium config Debug=Enabled
cilium config DebugLB=Enabled
cilium config MonitorAggregationLevel=None
watch -n2 "cilium bpf ct list global -Dd | grep 172.19.250.1 | grep -v 'remaining: -' | grep -v TxClosing"

Example CT entries -- we'll reuse 10.244.1.188:60590 -> 172.19.250.1:80

TCP SVC 10.244.1.188:60590 -> 172.19.250.1:80 expires=2009248 (remaining: 7917 sec(s)) RxPackets=0 RxBytes=10 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1
a LastTxReport=2001248 Flags=0x0010 [ SeenNonSyn ] RevNAT=11 SourceSecurityID=0 IfIndex=0
TCP SVC 10.244.1.188:35704 -> 172.19.250.1:80 expires=2009328 (remaining: 7997 sec(s)) RxPackets=0 RxBytes=13 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1
a LastTxReport=2001328 Flags=0x0010 [ SeenNonSyn ] RevNAT=11 SourceSecurityID=0 IfIndex=0

In another terminal, issue curl using source port for the client pod above (e.g. 60590)

NOTE: may need to wait for the port binding to be released to reuse the port

k exec -it client -- curl --local-port 60590 172.19.250.1:80
curl: (7) Failed to connect to 172.19.250.1 port 80 after 0 ms: Couldn't connect to server
k exec -it client -- curl --local-port 55555 172.19.250.1:80
<!DOCTYPE html>
...

Debug output for that flow with CT collision:

k exec -it -n kube-system ${agent} -c cilium-agent -- cilium monitor -Dv --related-to $( k get cep client -ojsonpath='{ .status.id }' )
...
<- endpoint 3046 flow 0xb4cc68f6 , identity 57167->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.244.1.188:60590 -> 172.19.250.1:80 tcp SYN
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Conntrack lookup 1/2: src=10.244.1.188:60590 dst=172.19.250.1:80
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=4 dir=2 scope=1
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: CT entry found lifetime=2009568, revnat=11
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: CT verdict: Reply, revnat=11
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Backend service lookup failed: backend_id=10
-> lxc9cc27971afd0: 172.19.250.1 -> 10.244.1.188 DestinationUnreachable(Port)
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Inheriting identity=2 from stack
<- stack flow 0xb4cc68f6 , identity world->unknown state unknown ifindex lxc9cc27971afd0 orig-ip 0.0.0.0: 172.19.250.1 -> 10.244.1.188 DestinationUnreachable(Port)
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Conntrack lookup 1/2: src=172.19.250.1:0 dst=10.244.1.188:0
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=2 dir=1 scope=2
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: CT entry found lifetime=2001628, revnat=0
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: CT verdict: Established, revnat=0
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Successfully mapped addr=172.19.250.1 to identity=2
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Attempting local delivery for container id 3046 from seclabel 57167
-> endpoint 3046 flow 0xb4cc68f6 , identity world->57167 state established ifindex 0 orig-ip 172.19.250.1: 172.19.250.1 -> 10.244.1.188 DestinationUnreachable(Port)

@sypakine
Copy link
Author

sypakine commented May 15, 2024

This is not reproducible on main* using my reproduction steps; toggling the interface state clears the conntrack entries. I'll have to cp potential fix to 1.15 and see if I can reproduce.

  • Daemon: 1.16.0-dev 57db22b202 2024-05-02T13:30:32+01:00 go version go1.22.3 linux/amd64

@sypakine
Copy link
Author

sypakine commented May 16, 2024

Confirmed that recent PR fixes the issue:

 k exec -it -n kube-system ${agent} -c cilium-agent -- cilium monitor -Dv --related-to $( k get cep client -ojsonpath='{ .status.id }' )
Listening for events on 48 CPUs with 64x4096 of shared memory
Press Ctrl-C to quit
level=info msg="Initializing dissection cache..." subsys=monitor
<- endpoint 71 flow 0xfd382215 , identity 30992->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.244.1.220:37064 -> 172.19.250.1:80 tcp SYN
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Conntrack lookup 1/2: src=10.244.1.220:37064 dst=172.19.250.1:80
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=4 dir=2 scope=1
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: CT entry found lifetime=71691, revnat=7
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: CT verdict: Reply, revnat=7
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Backend service lookup failed: backend_id=22
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Service backend slot lookup: slot=1, dport=80
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Conntrack lookup 1/2: src=10.244.1.220:37064 dst=10.244.0.78:80
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=1 dir=0 scope=2
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: CT verdict: New, revnat=0
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Successfully mapped addr=10.244.0.78 to identity=44471
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Conntrack create: proxy-port=0 revnat=7 src-identity=30992 lb=0.0.0.0
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Encapsulating to node 2886926339 (0xac130003) from seclabel 30992
-> overlay flow 0xfd382215 , identity 30992->44471 state new ifindex cilium_vxlan orig-ip 0.0.0.0: 10.244.1.220:37064 -> 10.244.0.78:80 tcp SYN
CPU 18: MARK 0x2035a252 FROM 71 DEBUG: Inheriting identity=44471 from stack
<- stack flow 0x2035a252 , identity 44471->unknown state unknown ifindex cilium_vxlan orig-ip 0.0.0.0: 10.244.0.78:80 -> 10.244.1.220:37064 tcp SYN, ACK

Can we consider backporting that change?

@julianwiedmann
Copy link
Member

Can we consider backporting that change?

I remember cleaning up a few things beforehand, to make that PR possible. Let me have a look, I wasn't expecting to fix an actual bug 😲. Do you have an idea what's going wrong exactly / why the patch is helping?

@sypakine
Copy link
Author

Because the external traffic policy is local

root@kind-worker:/home/cilium# cilium bpf lb list
SERVICE ADDRESS      BACKEND ADDRESS (REVNAT_ID) (SLOT)
172.19.250.1:80/i    10.244.0.186:80 (4) (1)
                     0.0.0.0:0 (4) (0) [LoadBalancer, Local, two-scopes]

this part is returning null:

	key->scope = LB_LOOKUP_SCOPE_EXT;
	key->backend_slot = 0;
	svc = map_lookup_elem(&LB4_SERVICES_MAP_V2, key);
        if (svc) {
		...
	}

	return NULL;

https://sourcegraph.com/github.com/cilium/cilium@a368c8f0f34dfe9a47e8a621af31ea94337f6fb5/-/blob/bpf/lib/lb.h?L1261:16-1261:35

So because of that we do not try to look for a new backend:

		backend = lb4_lookup_backend(ctx, backend_id);
		if (unlikely(!backend || backend->flags != BE_STATE_ACTIVE)) {
			/* Drain existing connections, but redirect new ones to only
			 * active backends.
			 */
			if (backend && !state->syn)
				break;

			svc = lb4_lookup_service(key, false, true);
			if (!svc)
				goto no_service;

https://sourcegraph.com/github.com/cilium/cilium@v1.15.1/-/blob/bpf/lib/lb.h?L1626-1636

The new logic is much more straightforward:

  1. lookup backend from ct entry
    • backend not found
  2. are there service backends?
    • yes
  3. lookup a new backend

https://sourcegraph.com/github.com/cilium/cilium@main/-/blob/bpf/lib/lb.h?L1567-1578

@julianwiedmann
Copy link
Member

Oh, ok - I think I see what you're hitting. Missed that the SVC access is coming from bpf_lxc.

So the initial SVC lookup uses scope_switch = is_defined(ENABLE_NODEPORT), while the fallback path uses scope_switch = false. Is that the problem?! If you temporarily fix that scope_switch parameter on v1.15, does that also solve the issue?

@sypakine
Copy link
Author

Changing to the below does fix the issue based on my reproduction scenario:

			svc = lb4_lookup_service(key, is_defined(ENABLE_NODEPORT), true);
			if (!svc)
				goto no_service;

I'm not sure the full implications of this / how safe it is.

@julianwiedmann
Copy link
Member

Sweet, thank you! Could you send a PR for v1.15 with that fix (adding a scope_switch parameter to lb*_local()) ? That seems like a much cleaner, well-isolated fix. And backportable further back

sypakine added a commit to sypakine/cilium that referenced this issue May 17, 2024
Fixes: cilium#32472

When looking up a loadbalancer service in lxc from-container, incorrect scope_switch value skips the correct return path. Set scope switch value to be consistent with earlier service lookup.

Note: this change only applies to v1.15 and below, as v1.16 has been refactored to prevent the lb4_lookup_service call.

This can lead to packet drops due to failed service lookup when a stale CT entry maps to a backend ID that is no longer present in the service map.

Signed-off-by: Mark St John <markstjohn@google.com>
sypakine added a commit to sypakine/cilium that referenced this issue May 17, 2024
When looking up a loadbalancer service in lxc from-container, scope_switch value for flows matching an existing conntrack entry may skip the correct return path. Set scope switch value to be consistent with earlier service lookup.

Note: this change only applies to v1.15 and below, as v1.16 has been refactored to prevent the lb4_lookup_service call.

This can lead to packet drops due to failed service lookup when a stale CT entry maps to a backend ID that is no longer present in the service map.

Fixes: cilium#32472

Signed-off-by: Mark St John <markstjohn@google.com>
@sypakine
Copy link
Author

Thanks @julianwiedmann -- done.

I don't see any BPF unit tests for this function, so let me know if you think it prudent to create a test for this change.

@julianwiedmann
Copy link
Member

I don't see any BPF unit tests for this function, so let me know if you think it prudent to create a test for this change.

Good question! I think for your specific test we should be alright, the bugfix is obvious enough. But having more test coverage for eTP=local scenarios would be good indeed. I'm just not sure whether BPF unit tests are the right scope for that - the map programming is complicated enough that we could easily drift from how the agent sets up the two service entries.

I'd have a bit more confidence into e2e tests for the different eTP=local scenarios (external / in-cluster access, node with/without local backend, selected backend gets terminated, ...).

sypakine added a commit to sypakine/cilium that referenced this issue May 21, 2024
For eTP=local services, we add two entries to the service map - one with only the local backends, and a second entry with all backends. In-cluster access is meant to use the wider-scope entry.

Due to incorrect scope switch, in-cluster access to eTP=local services doesn't fall through to the "all backends" entry. As result, a client on a node with no local backend would be unable to fall back, in case its initial backend terminates.

Note: this change only applies to v1.15 and below, as v1.16 has been refactored to prevent the lb4_lookup_service call.

Fixes: cilium#32472

Signed-off-by: Mark St John <markstjohn@google.com>
sypakine added a commit to sypakine/cilium that referenced this issue May 21, 2024
For eTP=local services, we add two entries to the service map - one with only the local backends, and a second entry with all backends. In-cluster access is meant to use the wider-scope entry.

Due to incorrect scope switch, in-cluster access to eTP=local services doesn't fall through to the "all backends" entry. As result, a client on a node with no local backend would be unable to fall back, in case its initial backend terminates.

Note: this change only applies to v1.15 and below, as v1.16 has been refactored to prevent the lb4_lookup_service call.

Fixes: cilium#32472

Signed-off-by: Mark St John <markstjohn@google.com>
lmb pushed a commit that referenced this issue May 22, 2024
For eTP=local services, we add two entries to the service map - one with only the local backends, and a second entry with all backends. In-cluster access is meant to use the wider-scope entry.

Due to incorrect scope switch, in-cluster access to eTP=local services doesn't fall through to the "all backends" entry. As result, a client on a node with no local backend would be unable to fall back, in case its initial backend terminates.

Note: this change only applies to v1.15 and below, as v1.16 has been refactored to prevent the lb4_lookup_service call.

Fixes: #32472

Signed-off-by: Mark St John <markstjohn@google.com>
@julianwiedmann
Copy link
Member

Closing via #32608.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/loadbalancing Impacts load-balancing and Kubernetes service implementations feature/conntrack info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

No branches or pull requests

2 participants