bpf: lxc: defer CT_INGRESS entry creation for loopback connections #27602

julianwiedmann · 2023-08-21T06:20:49Z

Currently the CT_INGRESS entry for a loopback connection is already created when the first request leaves the client, as from-container creates the CT_EGRESS entry (see the loopback handling in ct_create4() for details).

This is unusual - we would normally just create the CT_INGRESS entry as the first packet passes through to-container into the backend. But for loopback connections it is needed, so that
1.) to-container finds a CT entry with .loopback set, and thus skips
network policy enforcement even for the first packet, and
2.) the CT entry has its rev_nat_index field populated, and thus can
RevNAT replies in from-container.

This approach conflicts with the fact that loopback replies skip the client's to-container path (to avoid network policy enforcement).

Once the loopback connection is closed, the backend's from-container path observes the FIN / RST, and __ct_lookup4() updates the CT_INGRESS entry's lifetime to CT_CLOSE_TIMEOUT. But the client's to-container path will not observe the FIN / RST, and thus the CT_EGRESS entry's lifetime remains as CT_CONNECTION_LIFETIME_TCP. Therefore the CT_INGRESS entry will expire earlier, and potentially gets garbage-collected while the CT_EGRESS entry stays in place.

If the same loopback connection is subsequently re-opened, the client's from-container path finds the CT_EGRESS entry and thus will not call ct_create4(). Consequently the CT_INGRESS entry is not created, and the backend will not apply the loopback-specific handling described above. Inbound requests are potentially dropped (due to network policy), and/or replies are not RevNATed.

Fix this up by letting the backend path create its CT_INGRESS entry as usual. It just needs a bit of detection code in its CT_NEW handling to understand that the first packet belongs to a .loopback connection, and populate its own CT_INGRESS entry accordingly.

Fix connectivity issues caused by missing conntrack entry when service pod connects to itself via clusterIP.

julianwiedmann · 2023-08-21T06:21:13Z

/test

julianwiedmann · 2023-08-21T07:15:19Z

/test

aojea · 2023-08-21T11:40:31Z

can you please add me as reviewer, I will make some time during this week to review and test this

aojea · 2023-08-22T10:37:23Z

surprisingly I can not reproduce it with and without this patch, however, the CT entries are different

without the patch

cilium bpf ct list global -d | grep  41746
TCP IN 169.254.42.1:41746 -> 10.244.0.70:80 expires=1600072 (remaining: 21442 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=21 TxBytes=2109 TxFlagsSeen=0x1b LastTxReport=1578472 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=48450 IfIndex=0
TCP OUT 169.254.42.1:41746 -> 10.244.0.70:80 expires=1600072 (remaining: 21442 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=30 TxBytes=2445 TxFlagsSeen=0x1b LastTxReport=1578472 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=48450 IfIndex=0
TCP OUT 10.96.234.161:80 -> 10.244.0.70:41746 service expires=1600072 (remaining: 21442 sec(s)) RxPackets=0 RxBytes=9 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1b LastTxReport=1578472 Flags=0x0012 [ TxClosing SeenNonSyn ] RevNAT=6 SourceSecurityID=0 IfIndex=0

with this patch

TCP IN 169.254.42.1:41746 -> 10.244.0.70:80 expires=1580663 (remaining: -10 sec(s)) RxPackets=6 RxBytes=489 RxFlagsSeen=0x1b LastRxReport=1580653 TxPackets=4 TxBytes=407 TxFlagsSeen=0x1b LastTxReport=1580653 Flags=0x001b [ RxClosing TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=48450 IfIndex=0
TCP OUT 169.254.42.1:41746 -> 10.244.0.70:80 expires=1602253 (remaining: 21580 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=6 TxBytes=489 TxFlagsSeen=0x1b LastTxReport=1580653 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=48450 IfIndex=0
TCP OUT 10.96.234.161:80 -> 10.244.0.70:41746 service expires=1602253 (remaining: 21580 sec(s)) RxPackets=0 RxBytes=9 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1b LastTxReport=1580653 Flags=0x0012 [ TxClosing SeenNonSyn ] RevNAT=6 SourceSecurityID=0 IfIndex=0

bpf/bpf_lxc.c

aojea · 2023-08-23T15:08:17Z

ok, @julianwiedmann I decipher the problem I was having to reproduce it.

I need to start from clean state each time I try a new version, this means flushing conntrack table and install a new pod (I do not know how to force the bpf recompilation :/)

With this patch I observe the "wrong" behavior, asymmetry on the timers (though RxClosing is set correctly) it sets the remaining time of TCP_IN to 5s, but the TCP_OUT to 21595s

cilium bpf ct list global -d | grep 43488
TCP OUT 10.96.77.183:80 -> 10.244.1.160:43488 service expires=1700265 (remaining: 21595 sec(s)) RxPackets=0 RxBytes=13 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1b LastTxReport=1678662 Flags=0x0012 [ TxClosing SeenNonSyn ] RevNAT=5 SourceSecurityID=0 IfIndex=0
TCP IN 169.254.42.1:43488 -> 10.244.1.160:80 expires=1678675 (remaining: 5 sec(s)) RxPackets=15 RxBytes=1254 RxFlagsSeen=0x1b LastRxReport=1678662 TxPackets=15 TxBytes=1881 TxFlagsSeen=0x1b LastTxReport=1678662 Flags=0x001b [ RxClosing TxClosing LBLoopback SeenNonSyn ] RevNAT=5 SourceSecurityID=48450 IfIndex=0
TCP OUT 169.254.42.1:43488 -> 10.244.1.160:80 expires=1700265 (remaining: 21595 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=15 TxBytes=1254 TxFlagsSeen=0x1b LastTxReport=1678662 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=5 SourceSecurityID=48450 IfIndex=0

with versions v1.14.0 and v1.14.1 the timers look ok, but the RxClosing state is not present

cilium bpf ct list global -d | grep 43488
TCP OUT 10.96.239.241:80 -> 10.244.1.167:43488 service expires=1699206 (remaining: 21596 sec(s)) RxPackets=0 RxBytes=11 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1b LastTxReport=1677602 Flags=0x0012 [ TxClosing SeenNonSyn ] RevNAT=6 SourceSecurityID=0 IfIndex=0
TCP OUT 169.254.42.1:43488 -> 10.244.1.167:80 expires=1699206 (remaining: 21596 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=15 TxBytes=1257 TxFlagsSeen=0x1b LastTxReport=1677602 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=48450 IfIndex=0
TCP IN 169.254.42.1:43488 -> 10.244.1.167:80 expires=1699206 (remaining: 21596 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=16 TxBytes=1955 TxFlagsSeen=0x1b LastTxReport=1677602 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=48450 IfIndex=0

This is how can you reproduce the issue, the problem is that this is not easy to automate as it depends on the timers to be incorrect and eventually get one of these entries that has expired in one direction and not in the other, that make take some time O(min)

Install cilium with kind KUBEPROXY_MODE="none" make kind && make kind-image
Install cilium now , I use this one but for this case I assume any with kube-proxy strict should be enough

helm upgrade -i cilium ./install/kubernetes/cilium \
    --wait \
    --namespace kube-system \
    --set k8sServiceHost="kind-control-plane" \
    --set k8sServicePort="6443" \
    --set debug.enabled=true \
    --set pprof.enabled=true \
    --set enableIPv4Masquerade=false \
    --set enableIPv6Masquerade=false \
    --set enableIPv4Masquerade=false \
    --set hostFirewall.enabled=false \
    --set socketLB.enabled=true \
    --set socketLB.hostNamespaceOnly=true \
    --set kubeProxyReplacement=strict \
    --set nodeinit.enabled=true \
    --set ipam.mode=kubernetes \
    --set routingMode=native \
    --set ipv4.enabled=true \
    --set ipv4NativeRoutingCIDR=10.244.0.0/16 \
    --set ipv6.enabled=false \
    --set image.override="localhost:5000/cilium/cilium-dev:local" \
    --set image.pullPolicy=Never \
    --set operator.image.override="localhost:5000/cilium/operator-generic:local" \
    --set operator.image.pullPolicy=Never \
    --set operator.image.suffix="" \
    --set securityContext.privileged=true \
    --set gatewayAPI.enabled=false

Now these are the steps to reproduce, everything you install a new cilium versions you have to start from here by doing
make kind-image to build a cilium image for current checkout commit
kubectl -n kube-system rollout restart ds/cilium to reload the cilium agent

Install a Pod with a Services that references itself and it runs a web server

apiVersion: apps/v1
kind: Deployment
metadata:
  name: server-deployment
  labels:
    app: MyApp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: MyApp
  template:
    metadata:
      labels:
        app: MyApp
    spec:
      nodeName: kind-worker
      containers:
      - name: agnhost
        image: httpd:2
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: myservice
spec:
  type: ClusterIP
  selector:
    app: MyApp
  clusterIP: 10.96.7.7
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80

The pod will always run in the node kind-workerfor simplicity and the Service has the ClusterIP hardcoded to 10.96.7.7, I run 100000 request to be sure all ephemera ports are exercised and I grep for one of them

kubectl get pods
NAME                                 READY   STATUS    RESTARTS   AGE
server-deployment-5d9f6fdf8b-ggbvm   1/1     Running   0          12s

kubectl exec -it server-deployment-5d9f6fdf8b-ggbvm -- ab -n 60000 -c 100 http://10.96.7.7/
This is ApacheBench, Version 2.3 <$Revision: 1903618 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 10.96.7.7 (be patient)
Completed 6000 requests
Completed 12000 requests
Completed 18000 requests
Completed 24000 requests
Completed 30000 requests

In other console log into the corresponding cilium agent to get the conntrack entries

docker exec -it kind-worker bash
crictl exec -it $(crictl ps | grep cilium-agent | awk '{print $1}') bash
 cilium bpf ct list global -d | grep 43488
TCP OUT 10.96.77.183:80 -> 10.244.1.160:43488 service expires=1700265 (remaining: 21595 sec(s)) RxPackets=0 RxBytes=13 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1b LastTxReport=1678662 Flags=0x0012 [ TxClosing SeenNonSyn ] RevNAT=5 SourceSecurityID=0 IfIndex=0
TCP IN 169.254.42.1:43488 -> 10.244.1.160:80 expires=1678675 (remaining: 5 sec(s)) RxPackets=15 RxBytes=1254 RxFlagsSeen=0x1b LastRxReport=1678662 TxPackets=15 TxBytes=1881 TxFlagsSeen=0x1b LastTxReport=1678662 Flags=0x001b [ RxClosing TxClosing LBLoopback SeenNonSyn ] RevNAT=5 SourceSecurityID=48450 IfIndex=0
TCP OUT 169.254.42.1:43488 -> 10.244.1.160:80 expires=1700265 (remaining: 21595 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=15 TxBytes=1254 TxFlagsSeen=0x1b LastTxReport=1678662 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=5 SourceSecurityID=48450 IfIndex=0

aojea · 2023-08-23T16:20:55Z

I talked with @sypakine and he still sees the issue in v1.14.0, it turns out that the behavior exhibited depends on the flags, with this config in v1.14.0 still shows assymetrics values

KUBEPROXY_MODE="none" make kind && \
make kind-image && \
helm upgrade -i cilium ./install/kubernetes/cilium \
  --wait \
  --namespace kube-system \
  --set k8sServiceHost="kind-control-plane" \
  --set k8sServicePort="6443" \
  --set debug.enabled=true \
  --set pprof.enabled=true \
  --set enableIPv4Masquerade=false \
  --set enableIPv6Masquerade=false \
  --set enableIPv4Masquerade=false \
  --set hostFirewall.enabled=false \
  --set socketLB.hostNamespaceOnly=true \
  --set kubeProxyReplacement=strict \
  --set nodeinit.enabled=true \
  --set ipam.mode=kubernetes \
  --set ipv4.enabled=true \
  --set ipv4NativeRoutingCIDR=10.244.0.0/16 \
  --set ipv6.enabled=false \
  --set image.override="localhost:5000/cilium/cilium-dev:local" \
  --set image.pullPolicy=Never \
  --set operator.image.override="localhost:5000/cilium/operator-generic:local" \
  --set operator.image.pullPolicy=Never \
  --set operator.image.suffix="" \
  --set securityContext.privileged=true \
  --set gatewayAPI.enabled=false \
  --set socketLB.enabled=false \
  --set bpf.hostLegacyRouting=true \
  --set endpointRoutes.enabled=true

julianwiedmann · 2023-08-24T08:29:47Z

With this patch I observe the "wrong" behavior, asymmetry on the timers (though RxClosing is set correctly) it sets the remaining time of TCP_IN to 5s, but the TCP_OUT to 21595s

Yep, not trying to fix the timers 😁. That would require fixing up all the path symmetry for loopback traffic. Bigger topic.

But if the TCP_IN entry expires & gets GC-collected (while the TCP_OUT stays around), you now should see that a new connection re-creates the required TCP_IN entry.

it turns out that the behavior exhibited depends on the flags, with this config in v1.14.0 still shows assymetrics values

ack, endpointRoutes.enabled=true is not a good combination for loopback connections right now. It doesn't play well with the bypass logic.

aojea · 2023-08-24T13:20:03Z

Yep, not trying to fix the timers 😁. That would require fixing up all the path symmetry for loopback traffic. Bigger topic.
But if the TCP_IN entry expires & gets GC-collected (while the TCP_OUT stays around), you now should see that a new connection re-creates the required TCP_IN entry.

understood , cc @sypakine , I'll fall back to the while true; do curl http://serviceip ; done that eventually triggered this problem

aojea · 2023-08-24T14:12:32Z

I no longer hit the problem using the while loop for the localhostneither with 1.14.0 or this patch, this is an example of the entries

1.14.0 

 cilium bpf ct list global -d | grep 43350
TCP IN 10.96.7.7:80 -> 10.244.1.190:43350 expires=1787276 (remaining: 21581 sec(s)) RxPackets=8 RxBytes=898 RxFlagsSeen=0x1b LastRxReport=1765676 TxPackets=0 TxBytes=0 TxFlagsSeen=0x00 LastTxReport=0 Flags=0x0011 [ RxClosing SeenNonSyn ] RevNAT=0 SourceSecurityID=2 IfIndex=0
TCP IN 169.254.42.1:43350 -> 10.244.1.190:80 expires=1765686 (remaining: -9 sec(s)) RxPackets=12 RxBytes=954 RxFlagsSeen=0x1b LastRxReport=1765676 TxPackets=9 TxBytes=972 TxFlagsSeen=0x1b LastTxReport=1765676 Flags=0x001b [ RxClosing TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=27633 IfIndex=0
TCP OUT 10.96.7.7:80 -> 10.244.1.190:43350 service expires=1787276 (remaining: 21581 sec(s)) RxPackets=0 RxBytes=12 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1b LastTxReport=1765676 Flags=0x0012 [ TxClosing SeenNonSyn ] RevNAT=6 SourceSecurityID=0 IfIndex=0
TCP OUT 169.254.42.1:43350 -> 10.244.1.190:80 expires=1787276 (remaining: 21581 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=12 TxBytes=954 TxFlagsSeen=0x1b LastTxReport=1765676 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=27633 IfIndex=0


With the patch
TCP IN 10.96.7.7:80 -> 10.244.1.190:43350 expires=1787966 (remaining: 21595 sec(s)) RxPackets=4 RxBytes=450 RxFlagsSeen=0x1b LastRxReport=1766366 TxPackets=0 TxBytes=0 TxFlagsSeen=0x00 LastTxReport=0 Flags=0x0011 [ RxClosing SeenNonSyn ] RevNAT=0 SourceSecurityID=2 IfIndex=0
TCP IN 169.254.42.1:43350 -> 10.244.1.190:80 expires=1766376 (remaining: 5 sec(s)) RxPackets=6 RxBytes=477 RxFlagsSeen=0x1b LastRxReport=1766366 TxPackets=4 TxBytes=450 TxFlagsSeen=0x1b LastTxReport=1766366 Flags=0x001b [ RxClosing TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=27633 IfIndex=0
TCP OUT 10.96.7.7:80 -> 10.244.1.190:43350 service expires=1787966 (remaining: 21595 sec(s)) RxPackets=0 RxBytes=12 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1b LastTxReport=1766366 Flags=0x0012 [ TxClosing SeenNonSyn ] RevNAT=6 SourceSecurityID=0 IfIndex=0
TCP OUT 169.254.42.1:43350 -> 10.244.1.190:80 expires=1787966 (remaining: 21595 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=6 TxBytes=477 TxFlagsSeen=0x1b LastTxReport=1766366 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=27633 IfIndex=0

I always hit it before #23913 (comment)

bpf/bpf_lxc.c

sypakine · 2023-08-24T16:57:04Z

With this patch I observe the "wrong" behavior, asymmetry on the timers (though RxClosing is set correctly) it sets the remaining time of TCP_IN to 5s, but the TCP_OUT to 21595s

Yep, not trying to fix the timers 😁. That would require fixing up all the path symmetry for loopback traffic. Bigger topic.

But if the TCP_IN entry expires & gets GC-collected (while the TCP_OUT stays around), you now should see that a new connection re-creates the required TCP_IN entry.

it turns out that the behavior exhibited depends on the flags, with this config in v1.14.0 still shows assymetrics values

ack, endpointRoutes.enabled=true is not a good combination for loopback connections right now. It doesn't play well with the bypass logic.

I verified this fix. It does re-create the appropriate TCP_IN entry.

Repro:

gh repo clone cilium/ciium cilium-oss && cd !$ && gh pr checkout 27602
KUBEPROXY_MODE="none" make kind && \
make kind-image && \
helm upgrade -i cilium ./install/kubernetes/cilium \
  --wait \
  --namespace kube-system \
  --set k8sServiceHost="kind-control-plane" \
  --set k8sServicePort="6443" \
  --set debug.enabled=true \
  --set pprof.enabled=true \
  --set enableIPv4Masquerade=false \
  --set enableIPv6Masquerade=false \
  --set enableIPv4Masquerade=false \
  --set hostFirewall.enabled=false \
  --set socketLB.hostNamespaceOnly=true \
  --set kubeProxyReplacement=strict \
  --set nodeinit.enabled=true \
  --set ipam.mode=kubernetes \
  --set ipv4.enabled=true \
  --set ipv4NativeRoutingCIDR=10.244.0.0/16 \
  --set ipv6.enabled=false \
  --set image.override="localhost:5000/cilium/cilium-dev:local" \
  --set image.pullPolicy=Never \
  --set operator.image.override="localhost:5000/cilium/operator-generic:local" \
  --set operator.image.pullPolicy=Never \
  --set operator.image.suffix="" \
  --set securityContext.privileged=true \
  --set gatewayAPI.enabled=false \
  --set socketLB.enabled=false \
  --set bpf.hostLegacyRouting=true \
  --set endpointRoutes.enabled=true

alias kk="kubectl --context kind-kind"

kk --context kind-kind apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: server
  namespace: default
  labels:
    app: server
spec:
  containers:
  - name: agnhost
    image: k8s.gcr.io/e2e-test-images/agnhost:2.39
    args:
    - netexec
    - --http-port=80
    ports:
    - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: server-svc
spec:
  type: ClusterIP
  selector:
    app: server
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
EOF

time while ! [[ "$( kk get svc server-svc -ojsonpath='{.spec.clusterIP}' | xargs )" != "" ]]; do echo -n "."; sleep 1; done
clusterip=$( kk get svc server-svc -ojsonpath='{.spec.clusterIP}' )

time while ! [[ "$( kk get pod server -ojsonpath='{.status.conditions[?(@.type=="Ready")].status}' )" =~ "True" ]]; do echo -n "."; sleep 1; done

host=$( kk get pod server \
    -ojsonpath='{ .spec.nodeName }' )
agent=$( kk get pod -n kube-system \
    -l k8s-app=cilium \
    --field-selector=spec.nodeName=${host} \
    -ojsonpath='{.items[].metadata.name}' )

watch "kubectl --context kind-kind exec -it -n kube-system ${agent} -c cilium-agent -- cilium bpf ct list global -d | grep ':55555'"

# new terminal
kk exec server -- curl --no-keepalive --connect-timeout 5 --local-port 55555 -v ${clusterip}:80/hostname

# wait for TCP_IN entry to timeout
kk exec server -- curl --no-keepalive --connect-timeout 5 --local-port 55555 -v ${clusterip}:80/hostname

julianwiedmann · 2023-08-25T09:32:45Z

(pushed a slightly updated version with more comments, and tests for the expected CT entries)

julianwiedmann · 2023-08-25T09:39:20Z

/test

aojea · 2023-08-25T13:55:10Z

Ok, now I have a consistent way to reproduce it and verify this patch works, I don't know how can be easily automated since depend on lowering the value of --conntrack-gc-interval so it reaps the dead entry.

The steps to repro are exactly the same as #27602 (comment), in this case I just patch the cilium config-map to set a value of 60 seconds and restart the agents to pick up the new config

 kubectl -n kube-system get cm cilium-config -o yaml | grep conntrack-gc-interval
  conntrack-gc-interval: 1m0s

So, once you create the container you do the request against the service and it hairpins creating the conntrack entries that by default I think it has 15 seconds (I use the same trick as @sypakine so we have a well known port to grep curl --no-keepalive --connect-timeout 5 --local-port 55555 -v ${clusterip}:80/hostname )

bpf ct list global -d | grep 169.254.42.1:55555
TCP IN 169.254.42.1:55555 -> 10.244.1.70:80 expires=1851690 (remaining: 8 sec(s)) RxPackets=6 RxBytes=477 RxFlagsSeen=0x1b LastRxReport=1851680 TxPackets=4 TxBytes=452 TxFlagsSeen=0x1b LastTxReport=1851680 Flags=0x001b [ RxClosing TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=27633 IfIndex=0
TCP OUT 169.254.42.1:55555 -> 10.244.1.70:80 expires=1873280 (remaining: 21598 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=6 TxBytes=477 TxFlagsSeen=0x1b LastTxReport=1851680 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=27633 IfIndex=0

Now, after 60 seconds, the GC will delete that entry

# cilium bpf ct list global -d | grep 169.254.42.1:55555
TCP OUT 169.254.42.1:55555 -> 10.244.1.70:80 expires=1873280 (remaining: 21535 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=6 TxBytes=477 TxFlagsSeen=0x1b LastTxReport=1851680 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=27633 IfIndex=0

without this patch the connection times out

bash-5.0# curl --no-keepalive --connect-timeout 5 --local-port 55555 10.96.7.7
NOW: 2023-08-25 13:28:16.025963709 +0000 UTC m=+79359.922380945bash-5.0# curl --no-keepalive --connect-timeout 5 --local-port 55555 10.96.7.7
curl: (28) Connection timeout after 5001 ms

with the patch a new entry is created:

cilium bpf ct list global -d | grep 169.254.42.1:55555
TCP IN 169.254.42.1:55555 -> 10.244.1.70:80 expires=1851771 (remaining: 8 sec(s)) RxPackets=6 RxBytes=477 RxFlagsSeen=0x1b LastRxReport=1851761 TxPackets=4 TxBytes=452 TxFlagsSeen=0x1b LastTxReport=1851761 Flags=0x001b [ RxClosing TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=27633 IfIndex=0
TCP OUT 169.254.42.1:55555 -> 10.244.1.70:80 expires=1873361 (remaining: 21598 sec(s)) RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=12 TxBytes=954 TxFlagsSeen=0x1b LastTxReport=1851761 Flags=0x001a [ TxClosing LBLoopback SeenNonSyn ] RevNAT=6 SourceSecurityID=27633 IfIndex=0

I wish I knew how to create some test for this, ideally we should have some way of forcing the garbage collection of the conntrack entries, the alternative is to go to the bpf map and forcefully delete the entry

/lgtm

bpf/lib/conntrack.h

aditighag

Thanks for the detailed description, and making an effort to detangle the loopback datapath. I made an initial pass, still mulling over the fix. Curios to know your thoughts about my alternate approach.

In a nutshell, this PR updates the egress logic that previously also created an ingress entry for the loopback case, and instead let the ingress create a ct entry if the corresponding egress entry existed. The main issue where is that the datapath can't update the ingress entry gc timeout when it updates the egress entry, so there is inherently an asymmetry here.
From that perspective, the change makes sense to me, but it's not ideal that ct logic is so tightly coupled with policy enforcement.

If the same loopback connection is subsequently re-opened, the client's from-container path finds the CT_EGRESS entry and thus will not call ct_create4().

Just for the sake of discussion, can it check for an ingress entry, and create one if doesn't exist? On a related note, we have logic to select a new backend after some pre-configured time has passed (see ct_entry_expired_rebalance). The same logic should apply for loopback connections as well. Wdyt?

bpf/bpf_lxc.c

l3_local_delivery() contains a special case to bypass ingress policy for loopback traffic. But when redirecting loopback requests / replies in a per-EP config, they don't actually bypass the policy program. The redirect_ep() delivers them to the host-side veth interface, and ingress policy is applied as the packet passes to the pod-side veth peer. cilium#27602 recently changed the logic so that `hairpin_flow` is only set for replies. A subtle side-effect of that change is that in a per-EP config, l3_local_delivery() will now copy the the src sec identity into the skb->mark. cil_to_container() then extracts it, and tail_ipv4_to_endpoint() no longer performs a ipcache lookup for ip4->saddr (which is IPV4_LOOPBACK, and would result in WORLD_ID). Right now this is mostly a cosmetic improvement, as the identity is not used for policy enforcement of loopback traffic. But there are still users (eg. drop notifications) in tail_ipv4_to_endpoint() that benefit from having the correct security identity in place. Therefore make the same change in reply direction, so that we also get the correct security identity there (instead of an ipcache lookup for the Service IP, which would return WORLD_ID). For this push the loopback check into the tail-call specific path of l3_local_delivery(), while the per-EP path inserts the identity into skb->mark and then just calls its "normal" redirect_ep(). Reported-by: Timo Beckers <timo@isovalent.com> Signed-off-by: Julian Wiedmann <jwi@isovalent.com>

l3_local_delivery() contains a special case to bypass ingress policy for loopback traffic. But when redirecting loopback requests / replies in a per-EP config, they don't actually bypass the policy program. The redirect_ep() delivers them to the host-side veth interface, and ingress policy is applied as the packet passes to the pod-side veth peer. #27602 recently changed the logic so that `hairpin_flow` is only set for replies. A subtle side-effect of that change is that in a per-EP config, l3_local_delivery() will now copy the the src sec identity into the skb->mark. cil_to_container() then extracts it, and tail_ipv4_to_endpoint() no longer performs a ipcache lookup for ip4->saddr (which is IPV4_LOOPBACK, and would result in WORLD_ID). Right now this is mostly a cosmetic improvement, as the identity is not used for policy enforcement of loopback traffic. But there are still users (eg. drop notifications) in tail_ipv4_to_endpoint() that benefit from having the correct security identity in place. Therefore make the same change in reply direction, so that we also get the correct security identity there (instead of an ipcache lookup for the Service IP, which would return WORLD_ID). For this push the loopback check into the tail-call specific path of l3_local_delivery(), while the per-EP path inserts the identity into skb->mark and then just calls its "normal" redirect_ep(). Reported-by: Timo Beckers <timo@isovalent.com> Signed-off-by: Julian Wiedmann <jwi@isovalent.com>

julianwiedmann added kind/bug sig/datapath release-note/bug feature/conntrack area/loadbalancing labels Aug 21, 2023

julianwiedmann force-pushed the 1.15-bpf-lxc-loopback-ct-fix branch 2 times, most recently from ba45517 to 45ce1bd Compare August 21, 2023 07:14

julianwiedmann mentioned this pull request Aug 21, 2023

Return-traffic for LB hairpin using loopback address as destination #23913

Closed

2 tasks

julianwiedmann requested a review from aojea August 21, 2023 12:49

aojea reviewed Aug 23, 2023

View reviewed changes

bpf/bpf_lxc.c Outdated Show resolved Hide resolved

aojea reviewed Aug 24, 2023

View reviewed changes

bpf/bpf_lxc.c Show resolved Hide resolved

julianwiedmann force-pushed the 1.15-bpf-lxc-loopback-ct-fix branch from 45ce1bd to 275e351 Compare August 25, 2023 09:31

aojea reviewed Aug 25, 2023

View reviewed changes

bpf/lib/conntrack.h Show resolved Hide resolved

sypakine mentioned this pull request Aug 25, 2023

IPv4 Service hairpin return-traffic denied by ingress policy #27709

Closed

2 tasks

aditighag self-requested a review August 25, 2023 19:45

aditighag reviewed Aug 25, 2023

View reviewed changes

bpf/bpf_lxc.c Show resolved Hide resolved

julianwiedmann added the needs-backport/1.14 label Aug 28, 2023

julianwiedmann mentioned this pull request Aug 29, 2023

bpf: minor loopback cleanups #27764

Merged

julianwiedmann mentioned this pull request Aug 30, 2023

bpf: lxc: transfer sec identity for per-EP loopback in reply direction #27812

Merged

julianwiedmann added the sig/policy label Aug 30, 2023

ti-mo mentioned this pull request Aug 30, 2023

bpf: lxc: support Pod->Service->Pod hairpinning with endpoint routes #27798

Merged

jibi mentioned this pull request Sep 4, 2023

v1.14 Backports 2023-09-04 #27917

Merged

16 tasks

julianwiedmann added the backport/author label Sep 4, 2023

julianwiedmann mentioned this pull request Sep 4, 2023

[v1.14] bpf: lxc: defer CT_INGRESS entry creation for loopback connections #27920

Merged

1 task

julianwiedmann added backport-pending/1.14 and removed needs-backport/1.14 labels Sep 4, 2023

qmonnet mentioned this pull request Sep 6, 2023

[v1.12] bpf: lxc: support Pod->Service->Pod hairpinning with endpoint routes #27980

Merged

4 tasks

qmonnet added backport-pending/1.12 labels Sep 6, 2023

qmonnet mentioned this pull request Sep 6, 2023

v1.13 Backports 2023-09-04 #27925

Merged

10 tasks

julianwiedmann added backport-done/1.14 and removed backport-pending/1.14 labels Sep 7, 2023

qmonnet mentioned this pull request Sep 7, 2023

[v1.13] bpf: lxc: support Pod->Service->Pod hairpinning with endpoint routes #27998

Merged

2 tasks

qmonnet added backport-pending/1.13 and removed needs-backport/1.13 labels Sep 7, 2023

julianwiedmann added backport-done/1.13 and removed backport-pending/1.13 labels Sep 8, 2023

qmonnet added backport-done/1.12 and removed backport-pending/1.12 labels Sep 8, 2023

This was referenced Sep 9, 2023

Prepare for release v1.14.2 #28052

Merged

Prepare for release v1.13.7 #28053

Merged

Prepare for release v1.12.14 #28054

Merged

qmonnet mentioned this pull request Sep 28, 2023

cilium: backport v1.11 loopback fix #28250

Closed

qmonnet added the backport-pending/1.11 label Sep 28, 2023

julianwiedmann removed the backport-pending/1.11 label Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpf: lxc: defer CT_INGRESS entry creation for loopback connections #27602

bpf: lxc: defer CT_INGRESS entry creation for loopback connections #27602

julianwiedmann commented Aug 21, 2023 •

edited

Loading

julianwiedmann commented Aug 21, 2023

julianwiedmann commented Aug 21, 2023

aojea commented Aug 21, 2023

aojea commented Aug 22, 2023

aojea commented Aug 23, 2023 •

edited

Loading

aojea commented Aug 23, 2023

julianwiedmann commented Aug 24, 2023

aojea commented Aug 24, 2023

aojea commented Aug 24, 2023 •

edited

Loading

sypakine commented Aug 24, 2023

julianwiedmann commented Aug 25, 2023

julianwiedmann commented Aug 25, 2023

aojea commented Aug 25, 2023

aditighag left a comment •

edited

Loading

bpf: lxc: defer CT_INGRESS entry creation for loopback connections #27602

bpf: lxc: defer CT_INGRESS entry creation for loopback connections #27602

Conversation

julianwiedmann commented Aug 21, 2023 • edited Loading

julianwiedmann commented Aug 21, 2023

julianwiedmann commented Aug 21, 2023

aojea commented Aug 21, 2023

aojea commented Aug 22, 2023

aojea commented Aug 23, 2023 • edited Loading

aojea commented Aug 23, 2023

julianwiedmann commented Aug 24, 2023

aojea commented Aug 24, 2023

aojea commented Aug 24, 2023 • edited Loading

sypakine commented Aug 24, 2023

julianwiedmann commented Aug 25, 2023

julianwiedmann commented Aug 25, 2023

aojea commented Aug 25, 2023

aditighag left a comment • edited Loading

Choose a reason for hiding this comment

julianwiedmann commented Aug 21, 2023 •

edited

Loading

aojea commented Aug 23, 2023 •

edited

Loading

aojea commented Aug 24, 2023 •

edited

Loading

aditighag left a comment •

edited

Loading