-
Notifications
You must be signed in to change notification settings - Fork 3.2k
bpf: lxc: defer CT_INGRESS entry creation for loopback connections #27602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpf: lxc: defer CT_INGRESS entry creation for loopback connections #27602
Conversation
/test |
ba45517
to
45ce1bd
Compare
/test |
can you please add me as reviewer, I will make some time during this week to review and test this |
surprisingly I can not reproduce it with and without this patch, however, the CT entries are different without the patch
with this patch
|
ok, @julianwiedmann I decipher the problem I was having to reproduce it. I need to start from clean state each time I try a new version, this means flushing conntrack table and install a new pod (I do not know how to force the bpf recompilation :/) With this patch I observe the "wrong" behavior, asymmetry on the timers (though RxClosing is set correctly) it sets the remaining time of TCP_IN to 5s, but the TCP_OUT to 21595s
with versions v1.14.0 and v1.14.1 the timers look ok, but the RxClosing state is not present
This is how can you reproduce the issue, the problem is that this is not easy to automate as it depends on the timers to be incorrect and eventually get one of these entries that has expired in one direction and not in the other, that make take some time O(min)
Now these are the steps to reproduce, everything you install a new cilium versions you have to start from here by doing Install a Pod with a Services that references itself and it runs a web server apiVersion: apps/v1
kind: Deployment
metadata:
name: server-deployment
labels:
app: MyApp
spec:
replicas: 1
selector:
matchLabels:
app: MyApp
template:
metadata:
labels:
app: MyApp
spec:
nodeName: kind-worker
containers:
- name: agnhost
image: httpd:2
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: myservice
spec:
type: ClusterIP
selector:
app: MyApp
clusterIP: 10.96.7.7
ports:
- protocol: TCP
port: 80
targetPort: 80 The pod will always run in the node
In other console log into the corresponding cilium agent to get the conntrack entries
|
I talked with @sypakine and he still sees the issue in v1.14.0, it turns out that the behavior exhibited depends on the flags, with this config in v1.14.0 still shows assymetrics values
|
Yep, not trying to fix the timers 😁. That would require fixing up all the path symmetry for loopback traffic. Bigger topic. But if the TCP_IN entry expires & gets GC-collected (while the TCP_OUT stays around), you now should see that a new connection re-creates the required TCP_IN entry.
ack, |
understood , cc @sypakine , I'll fall back to the |
I no longer hit the problem using the while loop for the
I always hit it before #23913 (comment) |
I verified this fix. It does re-create the appropriate TCP_IN entry. Repro:
|
45ce1bd
to
275e351
Compare
(pushed a slightly updated version with more comments, and tests for the expected CT entries) |
/test |
Ok, now I have a consistent way to reproduce it and verify this patch works, I don't know how can be easily automated since depend on lowering the value of The steps to repro are exactly the same as #27602 (comment), in this case I just patch the cilium config-map to set a value of 60 seconds and restart the agents to pick up the new config
So, once you create the container you do the request against the service and it hairpins creating the conntrack entries that by default I think it has 15 seconds (I use the same trick as @sypakine so we have a well known port to grep
Now, after 60 seconds, the GC will delete that entry
without this patch the connection times out
with the patch a new entry is created:
I wish I knew how to create some test for this, ideally we should have some way of forcing the garbage collection of the conntrack entries, the alternative is to go to the bpf map and forcefully delete the entry /lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed description, and making an effort to detangle the loopback datapath. I made an initial pass, still mulling over the fix. Curios to know your thoughts about my alternate approach.
In a nutshell, this PR updates the egress logic that previously also created an ingress entry for the loopback case, and instead let the ingress create a ct entry if the corresponding egress entry existed. The main issue where is that the datapath can't update the ingress entry gc timeout when it updates the egress entry, so there is inherently an asymmetry here.
From that perspective, the change makes sense to me, but it's not ideal that ct logic is so tightly coupled with policy enforcement.
If the same loopback connection is subsequently re-opened, the client's from-container path finds the CT_EGRESS entry and thus will not call ct_create4().
Just for the sake of discussion, can it check for an ingress entry, and create one if doesn't exist? On a related note, we have logic to select a new backend after some pre-configured time has passed (see ct_entry_expired_rebalance
). The same logic should apply for loopback connections as well. Wdyt?
l3_local_delivery() contains a special case to bypass ingress policy for loopback traffic. But when redirecting loopback requests / replies in a per-EP config, they don't actually bypass the policy program. The redirect_ep() delivers them to the host-side veth interface, and ingress policy is applied as the packet passes to the pod-side veth peer. cilium#27602 recently changed the logic so that `hairpin_flow` is only set for replies. A subtle side-effect of that change is that in a per-EP config, l3_local_delivery() will now copy the the src sec identity into the skb->mark. cil_to_container() then extracts it, and tail_ipv4_to_endpoint() no longer performs a ipcache lookup for ip4->saddr (which is IPV4_LOOPBACK, and would result in WORLD_ID). Right now this is mostly a cosmetic improvement, as the identity is not used for policy enforcement of loopback traffic. But there are still users (eg. drop notifications) in tail_ipv4_to_endpoint() that benefit from having the correct security identity in place. Therefore make the same change in reply direction, so that we also get the correct security identity there (instead of an ipcache lookup for the Service IP, which would return WORLD_ID). For this push the loopback check into the tail-call specific path of l3_local_delivery(), while the per-EP path inserts the identity into skb->mark and then just calls its "normal" redirect_ep(). Reported-by: Timo Beckers <timo@isovalent.com> Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
l3_local_delivery() contains a special case to bypass ingress policy for loopback traffic. But when redirecting loopback requests / replies in a per-EP config, they don't actually bypass the policy program. The redirect_ep() delivers them to the host-side veth interface, and ingress policy is applied as the packet passes to the pod-side veth peer. #27602 recently changed the logic so that `hairpin_flow` is only set for replies. A subtle side-effect of that change is that in a per-EP config, l3_local_delivery() will now copy the the src sec identity into the skb->mark. cil_to_container() then extracts it, and tail_ipv4_to_endpoint() no longer performs a ipcache lookup for ip4->saddr (which is IPV4_LOOPBACK, and would result in WORLD_ID). Right now this is mostly a cosmetic improvement, as the identity is not used for policy enforcement of loopback traffic. But there are still users (eg. drop notifications) in tail_ipv4_to_endpoint() that benefit from having the correct security identity in place. Therefore make the same change in reply direction, so that we also get the correct security identity there (instead of an ipcache lookup for the Service IP, which would return WORLD_ID). For this push the loopback check into the tail-call specific path of l3_local_delivery(), while the per-EP path inserts the identity into skb->mark and then just calls its "normal" redirect_ep(). Reported-by: Timo Beckers <timo@isovalent.com> Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Currently the CT_INGRESS entry for a loopback connection is already created when the first request leaves the client, as from-container creates the CT_EGRESS entry (see the loopback handling in ct_create4() for details).
This is unusual - we would normally just create the CT_INGRESS entry as the first packet passes through to-container into the backend. But for loopback connections it is needed, so that
1.) to-container finds a CT entry with .loopback set, and thus skips
network policy enforcement even for the first packet, and
2.) the CT entry has its rev_nat_index field populated, and thus can
RevNAT replies in from-container.
This approach conflicts with the fact that loopback replies skip the client's to-container path (to avoid network policy enforcement).
Once the loopback connection is closed, the backend's from-container path observes the FIN / RST, and __ct_lookup4() updates the CT_INGRESS entry's lifetime to CT_CLOSE_TIMEOUT. But the client's to-container path will not observe the FIN / RST, and thus the CT_EGRESS entry's lifetime remains as CT_CONNECTION_LIFETIME_TCP. Therefore the CT_INGRESS entry will expire earlier, and potentially gets garbage-collected while the CT_EGRESS entry stays in place.
If the same loopback connection is subsequently re-opened, the client's from-container path finds the CT_EGRESS entry and thus will not call ct_create4(). Consequently the CT_INGRESS entry is not created, and the backend will not apply the loopback-specific handling described above. Inbound requests are potentially dropped (due to network policy), and/or replies are not RevNATed.
Fix this up by letting the backend path create its CT_INGRESS entry as usual. It just needs a bit of detection code in its CT_NEW handling to understand that the first packet belongs to a .loopback connection, and populate its own CT_INGRESS entry accordingly.