Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpf: nodeport: handle DSR at from-netdev / to-netdev #22978

Merged
merged 7 commits into from
Mar 7, 2023

Conversation

julianwiedmann
Copy link
Member

@julianwiedmann julianwiedmann commented Jan 9, 2023

(see partner PR #22756)

Replies by DSR service backends currently get their revDNAT processing in
from-container. There's two problems with that:

  1. we have cases where the packet doesn't reach this part of the
    from-container program (ie. redirect to host proxy), and
  2. we're not handling replies by hostns backends at all.

This PR tackles the problem as follows:

  1. Add support for RevDNAT of DSR connections in the to-netdev path.
  2. extract the DSR information (from IPv4 Options / IPv6 Extension Header) in the nodeport entry path, and not in to-container. Build a nodeport-level CT entry for connections that have DSR info (similar as for the local-backend case). Use the DSR info to build a SNAT entry for RevDNAT of the backend replies.
  3. reduce the DSR support in bpf_lxc to only cover old connections. The nodeport path can't easily create a CT entry for these connections (as for TCP we won't see another SYN with DSR info). There's possibilities how to move this into to-netdev as well, but let's try that in a follow-on PR.
When using KPR Nodeport with DSR, support backends in hostNetwork or with L7 policies.

Updated nodeport ingress diagram:

flowchart TD
    A[from-netdev] -->|CILIUM_CALL_IPV4_FROM_NETDEV| B{"bpf_skip_nodeport()?"}
    B --> |No| C["nodeport_lb4()"]
    B --> |Yes| B1["// further packet process"]
    C --> D["lb4_lookup_service()"]
    D --> E{"dst is svc ?"}
    E --> |No| F{"has DSR info / CT entry with .dsr set?"}
    E --> |Yes| F1["lb4_local()"]
    F --> |Yes| F2[" "]
    F2 --> |CILIUM_CALL_IPV4_NODEPORT_DSR_INGRESS| X["tail_nodeport_dsr_ingress_ipv4()"]
    X --> X1["create / update CT entry and SNAT entry"]
    X1 --> X2["bpf_skip_nodeport_set()"]
    X2 --> |CILIUM_CALL_IPV4_FROM_NETDEV| B
    F1 --> G1{"backend local?"}
    G1 --> |Yes| H1["CTX_ACT_OK"]
    G1 --> |No| H2[" "]
    H2 --> |CILIUM_CALL_IPV4_NODEPORT_NAT_EGRESS| G2["tail_nodeport_nat_egress_ipv4()"]
    G2 --> J2["snat_v4_nat()"]
    J2 --> K2{"found new SNAT mapping?"}
    K2 --> |No| L2["CTX_ACT_DROP"]
    K2 --> |Yes| M2["// Do SNAT"]
    M2 --> N2["ctx_redirect()"]
    F --> |No| F3[" "]
    F3 --> |CILIUM_CALL_IPV4_NODEPORT_NAT_INGRESS| G["tail_nodeport_nat_ingress_ipv4()"]
    G --> H["snat_v4_rev_nat()"]
    H --> J{"Reverse mapping exist?"}
    J --> |No| K["bpf_skip_nodeport_set()"]
    K --> |CILIUM_CALL_IPV4_FROM_NETDEV| B
    J --> |"Yes\n(This can be rev-SNAT for outside2pod)"| L["tail_rev_nodeport_lb4()"]
    L --> M["rev_nodeport_lb4()"]
    M --> N["ct_lb_lookup4()"]
    N --> O{"NodePort CT reply?"}
    O --> |Yes| P["// do rev-DNAT"]
    P --> R["ctx_redirect()"]
    O --> |No| Q["bpf_skip_nodeport_set()"]
    Q --> |CILIUM_CALL_IPV4_FROM_NETDEV| B

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 9, 2023
@julianwiedmann julianwiedmann added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Jan 9, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 9, 2023
@julianwiedmann
Copy link
Member Author

/test

@julianwiedmann
Copy link
Member Author

/test

@julianwiedmann
Copy link
Member Author

/test

@julianwiedmann
Copy link
Member Author

/test

@julianwiedmann
Copy link
Member Author

/test

@julianwiedmann
Copy link
Member Author

/test

@julianwiedmann
Copy link
Member Author

/test

@julianwiedmann julianwiedmann force-pushed the nodeport-hostns-dsr branch 2 times, most recently from 3a3d73d to e77c6f8 Compare January 23, 2023 14:50
@julianwiedmann
Copy link
Member Author

/test

@julianwiedmann julianwiedmann changed the title WIP DSR reply from hostns bpf: nodeport: handle revDNAT for DSR backends at to-netdev Jan 23, 2023
@julianwiedmann
Copy link
Member Author

/test

@julianwiedmann julianwiedmann added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label Jan 27, 2023
@julianwiedmann
Copy link
Member Author

/test

@julianwiedmann
Copy link
Member Author

julianwiedmann commented Jan 27, 2023

Some ideas for reviewers to double-check:

  • is there any change in behaviour for old connections (where bpf_lxc has already created the SNAT entry and flagged its CT entry as .dsr = true)
  • in from-container, is there anything in the code path after the call to xlate_dsr_v4() that might be surprised by now seeing replies that haven't been RevDNATed yet
  • does all the CT logic in tail_nodeport_dsr_ingress_ipv4() check out?

julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request Oct 30, 2023
DSR uses a OUT NAT entry for RevDNAT of backend replies. Prior to the
changes in cilium#22978, this NAT entry was
protected by the CT_INGRESS entry which bpf_lxc creates for the backend
connection.

Test that GC of the NAT entry works when the CT entry is removed.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request Oct 30, 2023
With cilium#22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request Oct 30, 2023
With cilium#22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
dylandreimerink pushed a commit that referenced this pull request Oct 30, 2023
DSR uses a OUT NAT entry for RevDNAT of backend replies. Prior to the
changes in #22978, this NAT entry was
protected by the CT_INGRESS entry which bpf_lxc creates for the backend
connection.

Test that GC of the NAT entry works when the CT entry is removed.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
dylandreimerink pushed a commit that referenced this pull request Oct 30, 2023
With #22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 30, 2023
[ upstream commit 5b22423 ]

DSR uses a OUT NAT entry for RevDNAT of backend replies. Prior to the
changes in cilium#22978, this NAT entry was
protected by the CT_INGRESS entry which bpf_lxc creates for the backend
connection.

Test that GC of the NAT entry works when the CT entry is removed.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 30, 2023
[ upstream commit 21072cd ]

With cilium#22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 30, 2023
[ upstream commit 5b22423 ]

DSR uses a OUT NAT entry for RevDNAT of backend replies. Prior to the
changes in cilium#22978, this NAT entry was
protected by the CT_INGRESS entry which bpf_lxc creates for the backend
connection.

Test that GC of the NAT entry works when the CT entry is removed.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 30, 2023
[ upstream commit 21072cd ]

With cilium#22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 30, 2023
[ upstream commit 5b22423 ]

DSR uses a OUT NAT entry for RevDNAT of backend replies. Prior to the
changes in cilium#22978, this NAT entry was
protected by the CT_INGRESS entry which bpf_lxc creates for the backend
connection.

Test that GC of the NAT entry works when the CT entry is removed.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 30, 2023
[ upstream commit 21072cd ]

With cilium#22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 30, 2023
[ upstream commit 5b22423 ]

DSR uses a OUT NAT entry for RevDNAT of backend replies. Prior to the
changes in cilium#22978, this NAT entry was
protected by the CT_INGRESS entry which bpf_lxc creates for the backend
connection.

Test that GC of the NAT entry works when the CT entry is removed.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 30, 2023
[ upstream commit 21072cd ]

With cilium#22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 31, 2023
[ upstream commit 5b22423 ]

DSR uses a OUT NAT entry for RevDNAT of backend replies. Prior to the
changes in cilium#22978, this NAT entry was
protected by the CT_INGRESS entry which bpf_lxc creates for the backend
connection.

Test that GC of the NAT entry works when the CT entry is removed.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 31, 2023
[ upstream commit 21072cd ]

With cilium#22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 31, 2023
[ upstream commit 5b22423 ]

DSR uses a OUT NAT entry for RevDNAT of backend replies. Prior to the
changes in cilium#22978, this NAT entry was
protected by the CT_INGRESS entry which bpf_lxc creates for the backend
connection.

Test that GC of the NAT entry works when the CT entry is removed.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 31, 2023
[ upstream commit 21072cd ]

With cilium#22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
aditighag pushed a commit that referenced this pull request Nov 2, 2023
[ upstream commit 5b22423 ]

DSR uses a OUT NAT entry for RevDNAT of backend replies. Prior to the
changes in #22978, this NAT entry was
protected by the CT_INGRESS entry which bpf_lxc creates for the backend
connection.

Test that GC of the NAT entry works when the CT entry is removed.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
aditighag pushed a commit that referenced this pull request Nov 2, 2023
[ upstream commit 21072cd ]

With #22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Nov 3, 2023
[ upstream commit 5b22423 ]

DSR uses a OUT NAT entry for RevDNAT of backend replies. Prior to the
changes in cilium#22978, this NAT entry was
protected by the CT_INGRESS entry which bpf_lxc creates for the backend
connection.

Test that GC of the NAT entry works when the CT entry is removed.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Nov 3, 2023
[ upstream commit 21072cd ]

With cilium#22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
jibi pushed a commit that referenced this pull request Nov 7, 2023
[ upstream commit 5b22423 ]

DSR uses a OUT NAT entry for RevDNAT of backend replies. Prior to the
changes in #22978, this NAT entry was
protected by the CT_INGRESS entry which bpf_lxc creates for the backend
connection.

Test that GC of the NAT entry works when the CT entry is removed.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
jibi pushed a commit that referenced this pull request Nov 7, 2023
[ upstream commit 21072cd ]

With #22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request May 21, 2024
cilium#22978 added support for DSR RevDNAT
at the native device-level. This was introduced in v1.14, and backported
to v1.13.3. We kept the support in bpf_lxc for old-style DSR around a bit
longer, to not break established connections. But with v1.16 it is about
time to remove this legacy path.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request May 21, 2024
cilium#22978 added support for DSR RevDNAT
at the native device-level. This was introduced in v1.14, and backported
to v1.13.3. We kept the support in bpf_lxc for old-style DSR around a bit
longer, to not break established connections. But with v1.16 it is about
time to remove this legacy path.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request May 21, 2024
#22978 added support for DSR RevDNAT
at the native device-level. This was introduced in v1.14, and backported
to v1.13.3. We kept the support in bpf_lxc for old-style DSR around a bit
longer, to not break established connections. But with v1.16 it is about
time to remove this legacy path.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DSR does not work for service endpoints in host netns
8 participants