-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpf: Track reply packets of egress gateway and SNATed host-local connections #25112
Conversation
2b35df4
to
5e81d6d
Compare
5e81d6d
to
9988063
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CLI related changes LGTM
:amaze: awesome stuff! Especially happy to see that we can use the optimized CT lookup in more places, I had been hoping for this. On first thought, I feel this should be split into multiple PRs - the IPv4 fragmentation support is worth its own PR (if only to discuss what testing we have in place for it). Or are you strongly relying on this support for the reply-tracking work? |
My initial idea was to reuse ct_extract_ports4 also for NAT, then I decided that it will look too cumbersome it I tried to combine this switch and ct_extract_ports4. So I kept them separate, but made the effect functionally equivalent. For that, I had to add missing IPv4 fragment handling. I think, there is no strong dependency here, that is, if I close my eyes on the bug with fragments, the rest of the logic in that switch is good enough for the optimized CT lookup. As IPv4 fragmentation fix turned to be not straightforward, I'll follow your advice and split the pull request, so that we could get the egress gateway fix merged first. |
Full-on 🙈 then. It's been like that forever, a few more days likely won't hurt. We can always add a trivial |
9988063
to
948b04e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! A few smaller change requests, mostly to highlight that we're fixing the CT lookup for more than "just" EgressGW replies. Would be good if you could adjust the PR title / description accordingly as well.
I don't quite follow why the first two patches in the series are needed - these are just cleanups, correct? Then I would suggest to split them into a follow-on PR, so this PR is easier to backport.
Note that we should also fix the IPv6 path (ie. have snat_v6_rev_nat_handle_mapping()
revSNAT the tuple before the CT lookup), but that's perfectly fine as follow-on PR.
948b04e
to
edf38dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! A few smaller change requests, mostly to highlight that we're fixing the CT lookup for more than "just" EgressGW replies. Would be good if you could adjust the PR title / description accordingly as well.
Done.
I don't quite follow why the first two patches in the series are needed - these are just cleanups, correct? Then I would suggest to split them into a follow-on PR, so this PR is easier to backport.
These were first steps in my bigger effort to unify the code. As it didn't happen, and there is indeed no dependency on the first two commits, I pushed them separately:
Note that we should also fix the IPv6 path (ie. have
snat_v6_rev_nat_handle_mapping()
revSNAT the tuple before the CT lookup), but that's perfectly fine as follow-on PR.
Done here, no reason to postpone it.
👍 But then we now also need all the corresponding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just need to decide whether we add the full IPv6 fix here, or punt to a separate PR.
I wanted to do it this way first, but it turned out to be tricky too %) If "Use the right tuple to track connections on revSNAT" went first, I would have to say something like that in its commit message:
I.e. I'd need to refer to the "next patch" in any case, but it would be even more verbose and less clear. Besides that, the egress gateway test is blocked on issue #25110, and the issue says nothing about egress gateway, it's about the bug in the tuples. That means, the commit that The two commits are really intertwined, I'm not sure it was a good idea to split them, but now that they are separate, I think the best way to untangle them is "Track egress gateway reply packets" first, then "Use the right tuple to track connections on revSNAT". Otherwise it's just much more cumbersome.
Ah right, I missed that it is not merely an optimization, but an actual necessity to avoid ct_lookup6 from overwriting the port. |
Rename ct_lb_lookup4 to ct_lazy_lookup4, because it's going to be used not only in LB flows, but now also in NAT. Add the action parameter to cover ICMP packets in NAT flows. The actual usage comes in the next commit, this one doesn't contain functional changes. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Flag the egress gateway NAT entries as needs_ct, so that reply packets could extend the expiration timeout as well. The next commit is also required for the complete fix, otherwise the needs_ct mark won't be matched. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Fix a bug in connection tracking in the revSNAT flow: reply packets are tracked before revSNAT restores the original tuple, meaning that it won't match the existing CT entry, and a new one will be created. This bug has impact on connections that set needs_ct (host-local and egress gateway connections): packets coming in the reply direction won't get the needs_ct mark and won't extend the expiration timeout. The fix is to convert the tuple before doing the conntrack lookup, so that it matches the tuple after revSNAT. Fixes: cilium#25110 Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
At the moment, both snat_v{4,6}_track_connection and snat_v{4,6}_new_mapping check the same conditions to determine whether a corresponding CT entry should exist for an SNAT entry. This approach has the following drawbacks: 1. Duplicate code: risk of divergence and a minor performance penalty, because this code is actually executed twice for the first packet. 2. The checks in snat_v{4,6}_track_connection are not optimal, because this function is called from both egress and ingress flows, but the checks are only relevant for the egress flow (because the first packet is egress, and the subsequent packets will have the result of the checks cached in the SNAT state). Solve both drawbacks by moving the checks one layer up, to snat_v{4,6}_nat_handle_mapping (also put the checks to a separate function for clarity). This way, the checks are only made in the egress flow when the SNAT state doesn't exist yet, and the result is reused for both snat_v{4,6}_track_connection and snat_v{4,6}_new_mapping, making the code more robust and slightly more optimized. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com> Suggested-by: Julian Wiedmann <jwi@isovalent.com>
edf38dd
to
5284924
Compare
Ah right, that's awkward. No worries then :). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks all good, thanks a ton @gentoo-root !
/test |
How do you feel about backporting this into |
I'll backport it, but it seems it has to be merged to main first. |
Would MLH add the label if I comment? 🤔 👀 |
Backport: #25424 |
Flag the egress gateway NAT entries as needs_ct, so that reply packets could extend the expiration timeout as well.
Fix the conntrack tuple matching for packets in the reply direction, so that the needs_ct flag could actually be useful. Currently, the tuple before revSNAT is used for conntrack and doesn't match the original CT entry. After the fix, the tuple is adjusted to contain data like after revSNAT, so that it could match the original CT entry and get the needs_ct flag from it.
Fixes: #25110