Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpf: Track reply packets of egress gateway and SNATed host-local connections #25112

Merged
merged 8 commits into from
May 12, 2023

Conversation

gentoo-root
Copy link
Contributor

@gentoo-root gentoo-root commented Apr 25, 2023

Flag the egress gateway NAT entries as needs_ct, so that reply packets could extend the expiration timeout as well.

Fix the conntrack tuple matching for packets in the reply direction, so that the needs_ct flag could actually be useful. Currently, the tuple before revSNAT is used for conntrack and doesn't match the original CT entry. After the fix, the tuple is adjusted to contain data like after revSNAT, so that it could match the original CT entry and get the needs_ct flag from it.

Fixes: #25110

Track reply packets in long-living egress gateway connections and SNATed host-local connections.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 25, 2023
@gentoo-root gentoo-root added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Apr 25, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 25, 2023
@gentoo-root gentoo-root force-pushed the egressgw-rst-fix-2 branch 2 times, most recently from 2b35df4 to 5e81d6d Compare April 25, 2023 15:21
@gentoo-root
Copy link
Contributor Author

/test

@gentoo-root gentoo-root marked this pull request as ready for review May 2, 2023 19:18
@gentoo-root gentoo-root requested review from a team as code owners May 2, 2023 19:18
Copy link
Member

@asauber asauber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLI related changes LGTM

@julianwiedmann
Copy link
Member

:amaze: awesome stuff! Especially happy to see that we can use the optimized CT lookup in more places, I had been hoping for this.

On first thought, I feel this should be split into multiple PRs - the IPv4 fragmentation support is worth its own PR (if only to discuss what testing we have in place for it). Or are you strongly relying on this support for the reply-tracking work?

bpf/lib/nat.h Outdated Show resolved Hide resolved
@gentoo-root
Copy link
Contributor Author

On first thought, I feel this should be split into multiple PRs - the IPv4 fragmentation support is worth its own PR (if only to discuss what testing we have in place for it). Or are you strongly relying on this support for the reply-tracking work?

My initial idea was to reuse ct_extract_ports4 also for NAT, then I decided that it will look too cumbersome it I tried to combine this switch and ct_extract_ports4.

So I kept them separate, but made the effect functionally equivalent. For that, I had to add missing IPv4 fragment handling.

I think, there is no strong dependency here, that is, if I close my eyes on the bug with fragments, the rest of the logic in that switch is good enough for the optimized CT lookup. As IPv4 fragmentation fix turned to be not straightforward, I'll follow your advice and split the pull request, so that we could get the egress gateway fix merged first.

@julianwiedmann
Copy link
Member

I think, there is no strong dependency here, that is, if I close my eyes on the bug with fragments, the rest of the logic in that switch is good enough for the optimized CT lookup. As IPv4 fragmentation fix turned to be not straightforward, I'll follow your advice and split the pull request, so that we could get the egress gateway fix merged first.

Full-on 🙈 then. It's been like that forever, a few more days likely won't hurt.

We can always add a trivial if (fragment) return NAT_PUNT_TO_STACK stop-gap if you hate it too much 😄.

Copy link
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! A few smaller change requests, mostly to highlight that we're fixing the CT lookup for more than "just" EgressGW replies. Would be good if you could adjust the PR title / description accordingly as well.

I don't quite follow why the first two patches in the series are needed - these are just cleanups, correct? Then I would suggest to split them into a follow-on PR, so this PR is easier to backport.

Note that we should also fix the IPv6 path (ie. have snat_v6_rev_nat_handle_mapping() revSNAT the tuple before the CT lookup), but that's perfectly fine as follow-on PR.

bpf/lib/nat.h Show resolved Hide resolved
bpf/lib/nat.h Show resolved Hide resolved
bpf/lib/nat.h Outdated Show resolved Hide resolved
@julianwiedmann julianwiedmann added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label May 10, 2023
@gentoo-root gentoo-root changed the title bpf: Track egress gateway reply packets bpf: Track reply packets of egress gateway and SNATed host-local connections May 10, 2023
Copy link
Contributor Author

@gentoo-root gentoo-root left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! A few smaller change requests, mostly to highlight that we're fixing the CT lookup for more than "just" EgressGW replies. Would be good if you could adjust the PR title / description accordingly as well.

Done.

I don't quite follow why the first two patches in the series are needed - these are just cleanups, correct? Then I would suggest to split them into a follow-on PR, so this PR is easier to backport.

These were first steps in my bigger effort to unify the code. As it didn't happen, and there is indeed no dependency on the first two commits, I pushed them separately:

#25356

Note that we should also fix the IPv6 path (ie. have snat_v6_rev_nat_handle_mapping() revSNAT the tuple before the CT lookup), but that's perfectly fine as follow-on PR.

Done here, no reason to postpone it.

bpf/lib/nat.h Show resolved Hide resolved
bpf/lib/nat.h Show resolved Hide resolved
bpf/lib/nat.h Outdated Show resolved Hide resolved
@julianwiedmann
Copy link
Member

Note that we should also fix the IPv6 path (ie. have snat_v6_rev_nat_handle_mapping() revSNAT the tuple before the CT lookup), but that's perfectly fine as follow-on PR.

Done here, no reason to postpone it.

👍 But then we now also need all the corresponding ct_lazy_lookup6() prep work for that path, no?

Copy link
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just need to decide whether we add the full IPv6 fix here, or punt to a separate PR.

bpf/lib/nat.h Show resolved Hide resolved
bpf/lib/nat.h Outdated Show resolved Hide resolved
@gentoo-root
Copy link
Contributor Author

Note that it would probably help readability if you flip the patch order around (ie. have bpf: Use the right tuple to track connections on revSNAT first).

I wanted to do it this way first, but it turned out to be tricky too %)

If "Use the right tuple to track connections on revSNAT" went first, I would have to say something like that in its commit message:

This bug has impact on connections that set needs_ct (currently
host-local connections, but after the bugfix in the next patch also
egress gateway connections): packets coming in the reply direction won't
get the needs_ct mark and won't extend the expiration timeout.

I.e. I'd need to refer to the "next patch" in any case, but it would be even more verbose and less clear.

Besides that, the egress gateway test is blocked on issue #25110, and the issue says nothing about egress gateway, it's about the bug in the tuples. That means, the commit that Fixes: #25110 would go first, but it wouldn't unblock the test blocked on #ifdef ISSUE_25110_FIXED that doesn't make any sense, and the next commit "Track egress gateway reply packets" would unblock the test, although it has nothing to do with fixing #25110.

The two commits are really intertwined, I'm not sure it was a good idea to split them, but now that they are separate, I think the best way to untangle them is "Track egress gateway reply packets" first, then "Use the right tuple to track connections on revSNAT". Otherwise it's just much more cumbersome.

But then we now also need all the corresponding ct_lazy_lookup6() prep work for that path, no?

Ah right, I missed that it is not merely an optimization, but an actual necessity to avoid ct_lookup6 from overwriting the port.

Rename ct_lb_lookup4 to ct_lazy_lookup4, because it's going to be used
not only in LB flows, but now also in NAT. Add the action parameter to
cover ICMP packets in NAT flows.

The actual usage comes in the next commit, this one doesn't contain
functional changes.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Flag the egress gateway NAT entries as needs_ct, so that reply packets
could extend the expiration timeout as well.

The next commit is also required for the complete fix, otherwise the
needs_ct mark won't be matched.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Fix a bug in connection tracking in the revSNAT flow: reply packets are
tracked before revSNAT restores the original tuple, meaning that it
won't match the existing CT entry, and a new one will be created.

This bug has impact on connections that set needs_ct (host-local and
egress gateway connections): packets coming in the reply direction won't
get the needs_ct mark and won't extend the expiration timeout.

The fix is to convert the tuple before doing the conntrack lookup, so
that it matches the tuple after revSNAT.

Fixes: cilium#25110
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
At the moment, both snat_v{4,6}_track_connection and
snat_v{4,6}_new_mapping check the same conditions to determine whether a
corresponding CT entry should exist for an SNAT entry. This approach has
the following drawbacks:

1. Duplicate code: risk of divergence and a minor performance penalty,
   because this code is actually executed twice for the first packet.

2. The checks in snat_v{4,6}_track_connection are not optimal, because
   this function is called from both egress and ingress flows, but the
   checks are only relevant for the egress flow (because the first
   packet is egress, and the subsequent packets will have the result of
   the checks cached in the SNAT state).

Solve both drawbacks by moving the checks one layer up, to
snat_v{4,6}_nat_handle_mapping (also put the checks to a separate
function for clarity). This way, the checks are only made in the egress
flow when the SNAT state doesn't exist yet, and the result is reused for
both snat_v{4,6}_track_connection and snat_v{4,6}_new_mapping, making
the code more robust and slightly more optimized.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Suggested-by: Julian Wiedmann <jwi@isovalent.com>
@julianwiedmann
Copy link
Member

Besides that, the egress gateway test is blocked on issue #25110, and the issue says nothing about egress gateway, it's about the bug in the tuples. That means, the commit that Fixes: #25110 would go first, but it wouldn't unblock the test blocked on #ifdef ISSUE_25110_FIXED that doesn't make any sense, and the next commit "Track egress gateway reply packets" would unblock the test, although it has nothing to do with fixing #25110.

Ah right, that's awkward. No worries then :).

Copy link
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good, thanks a ton @gentoo-root !

@julianwiedmann
Copy link
Member

/test

@julianwiedmann
Copy link
Member

How do you feel about backporting this into v1.13? Infrastructure-wise it should be doable I think, the CT changes landed with #24795.

@julianwiedmann julianwiedmann added kind/bug This is a bug in the Cilium logic. feature/egress-gateway Impacts the egress IP gateway feature. feature/conntrack labels May 12, 2023
@gentoo-root
Copy link
Contributor Author

How do you feel about backporting this into v1.13?

I'll backport it, but it seems it has to be merged to main first.

@pchaigno
Copy link
Member

Would MLH add the label if I comment? 🤔 👀

@pchaigno pchaigno merged commit 86cddf6 into cilium:main May 12, 2023
57 checks passed
@julianwiedmann julianwiedmann added the needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch label May 12, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.13.3 May 12, 2023
@gentoo-root gentoo-root mentioned this pull request May 12, 2023
1 task
@gentoo-root gentoo-root added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels May 12, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.13 in 1.13.3 May 12, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.13 in 1.13.3 May 12, 2023
@gentoo-root
Copy link
Contributor Author

Backport: #25424

@gentoo-root gentoo-root added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels May 23, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.13 to Backport done to v1.13 in 1.13.3 May 23, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.12.10 May 24, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Needs backport from main in 1.12.10 May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. feature/conntrack feature/egress-gateway Impacts the egress IP gateway feature. kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
No open projects
1.13.3
Backport done to v1.13
Development

Successfully merging this pull request may close these issues.

snat_v{4,6}_rev_nat_handle_mapping passes a wrong tuple to conntrack
4 participants