Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.13] backport for DSR handling in from/to-netdev #24795

Merged
merged 28 commits into from
Apr 19, 2023

Conversation

julianwiedmann
Copy link
Member

@julianwiedmann julianwiedmann commented Apr 10, 2023

This brings a whole bunch of PRs into v1.13 to fix the DSR support for backends in hostNetwork, or with L7 policy (#21955 (comment)).

Once this PR is merged, you can update the PR labels via:

for pr in 23326 23718 22936 22978 24524 24794; do contrib/backporting/set-labels.py $pr done 1.13; done

or with

make add-labels BRANCH=v1.13 ISSUES=23326,23718,22936,22978,24524,24794

@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. labels Apr 10, 2023
@julianwiedmann
Copy link
Member Author

/test-backport-1.13

[ upstream commit 997be00 ]

Currently we first extract the ports from the packet in their actual order,
and then store them into the CT tuple in reverse order.

Just follow the established pattern and immediately store into the CT
tuple in reverse order. While at it also switch to the l4_load_ports()
helper.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 0d62815 ]

The preceding revalidate_data() always locates the L3 header at
`data + ETH_HLEN`. So there's no need for a more generic approach when
determining the L4 offset a few lines later.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit bab80db ]

Let bpf_lxc provide a validated L3 header to the low-level DSR code. Also
pass this on to snat_*_create_dsr().

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit cff1e49 ]

Introduce a proper struct to get rid of all the complicated masking, and
clearly spell out the .type and .len parts of the option. Use this struct
to fill & extract the DSR info.

The on-wire format stays the same, and this patch doesn't try to solve the
endianness confusion for the .port and .addr fields.

When fetching the IPv4 option words, use a single ctx_load_bytes(). We've
previously checked ip4->ihl, so there's no risk that the packet might be
too short.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit d0a1450 ]

The pre-defined ports are in network byte-order, do the same for the IP
addresses.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 0991c24 ]

Add tests for forwarding to a remote backend with DSR, where the LB needs
to insert the IPv4 Option.

We're not testing local-backend here (it's identical to non-DSR), and
there's no reply path trough the LB node.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 18acac5 ]

Remove some dead code.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit ef8d9d9 ]

We currently have IPv4 and IPv6 variants of several helpers that update a
CT entry. These only differ in the CT tuple parameter, which then gets used
as key for a map lookup.

To avoid this duplication, take ct_update_nodeport() as example and turn
the CT tuple into an opaque parameter.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit b4a10f9 ]

As part of the nodeport SVC lookup, lb6_extract_key() currently extracts
the packet's destination port. Refactor this code path to use
lb6_extract_tuple() instead, thus loading _both_ L4 ports. We can use this
in a subsequent patch to slim down the CT lookup code.

lb6_extract_key() turns into a much simpler helper that only fills the
lb6_key.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 2c80ee5 ]

This helper is now only used for IPv4 packets. Remove the IPv6-specific
parts.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 28dfbaa ]

We already have a validated IPv4 header, so avoid jumping through
ipv4_ct_extract_l4_ports() for loading the ports.

Stick close to how lb4_extract_l4_port() lays out the code, as we want to
start using lb4_extract_tuple() in more places and the 4.19 verifier is
very picky about it.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit f6862ff ]

Same as for lb6_extract_tuple().

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 1c73ddf ]

Trust that earlier stages have already checked the L4 protocol, and we're
not suddenly processing a request with non-SVC protocol.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 05e6c35 ]

The SVC-to-backend DNAT code already updates the .daddr in the CT tuple.
Also update the backend port, so that the nodeport code can use this tuple
as-is to create an CT entry for RevDNAT of non-DSR connections.

Slightly pull down the .daddr update in lb6_local(), to keep the code in
sync with lb4_local() and have the whole tuple update in one place.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit b6c4e57 ]

Enforce proper types.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit eee902f ]

Allow the LB paths to perform a CT lookup without validating the
L3 protocol or extracting the L4 ports. This helps to reduce instruction
count in the nodeport code.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 284f62f ]

As the LB paths are doing their own L3 protocol checks and also extract
the L4 ports on their own, we can switch lb_local() and all the nodeport
code to use the optimized CT lookup helpers.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit a00b74d ]

DSR replies currently get their RevDNAT processing in from-container. That
doesn't work for hostNetwork backends, or when from-container sends the
packet off to the stack for L7 policy.

Commit 54a8631 ("bpf: nodeport: handle revDNAT for local backends at to-netdev/to-overlay")
recently fixed the same scenario for local backends, by adding a
RevDNAT stage in to-netdev. Extend this code to also handle DSR replies.

Note that right now no traffic should match this case, we'll only create
the necessary CT entries in a subsequent patch.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 05d084b ]

Let the nodeport code detect the DSR-specific Option/Extension, and create
a corresponding CT entry (same CT tuple as the pod-level CT entry, but with
CT_EGRESS). This is similar to what we do for service traffic to a local
backend. We do this in a separate tail-call for complexity reasons,
nodeport_lb() is just getting too crowded.
The DSR reply path also requires a SNAT entry to handle the RevDNAT, so
create that as well.

Existing DSR connections continue to use their established SNAT entries,
and trigger the RevDNAT from the .dsr flag in their pod-level CT entry.
Their SNAT entry is protected from GC by the pod-level CT entry.

New DSR connections live off the state created by nodeport_lb4(), and only
get revDNATed in to-netdev. Meaning that reply traffic by a pod for such
connections won't be RevDNATed as it passes through ie. netfilter. This
matches the inbound direction.

If a packet belongs to a DSR connection (ie gets handled by
tail_nodeport_dsr_ingress_*()), we know it's not reply traffic. Thus
we don't need to jump through CILIUM_CALL_*_NODEPORT_NAT_INGRESS
afterwards, and can circle straight back to CILIUM_CALL_*_FROM_NETDEV.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 0d81d23 ]

TCP connections have the DSR info only in their SYN packet. So for any
subsequent packet we don't detect that it belongs to a DSR connection, and
thus don't step through the nodeport CT code.

Manually identify packets for such connections, and send them down the
DSR-CT path so that they can do a proper ct_lookup() to update state and
stats. This requires some changes to that CT code, so that it tolerates
being called without DSR info.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit e01a9bc ]

Responsibility for creating the DSR SNAT entry has moved into the nodeport
code. Its corresponding CT entry looks the same as what would be created
on a pod level (ie. ingressCTKeyFromEgressNatKey()), except that we
intentionally flip the direction (TUPLE_F_OUT instead of TUPLE_F_IN).

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 93d12ee ]

All callers currently pass ETH_HLEN, but we need a bit more flexibility
when using the helper in unit tests.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit db5f376 ]

We'll need to insert IPv4 options for DSR tests.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit db6f972 ]

Test that an inbound request with DSR IP Option results in the expected
CT and SNAT entry, and gets passed up to the backend.

The reply path then uses the SNAT entry that for RevDNAT of a reply packet.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit d8799e6 ]

When DSR traffic arrives at a backend node, don't bother tracking its L2
source address. Its the LB's MAC address, but the replies need to go
straight to the client.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 995081d ]

Shuffle the two DSR-INGRESS tail-calls into the ENABLE_DSR-guarded section,
so that we can add calls to DSR-only code to them in a subsequent patch.

Their callers are already guarded in the same fashion.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 751da7f ]

We need to free up the CB_SRC_LABEL slot, but at least for IPv6 all skb->cb
slots are already in use. So stop using the CB for transporting the DSR
info, but just extract it a second time inside the tail-call.

There's opportunities to clean this up more, and also alternative
approaches (eg. use a per-CPU map entry). But let's keep things
intentationally simple for now.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 2c05517 ]

All nodeport code paths that tail-call back to IPV*_FROM_NETDEV need to
restore the CB_SRC_LABEL with `src_identity`.

We missed doing so for the DSR-Ingress tail-calls.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
@julianwiedmann
Copy link
Member Author

/test-backport-1.13

@julianwiedmann julianwiedmann changed the title WIP [v1.13] dsr backport [v1.13] backport for DSR handling in from/to-netdev Apr 17, 2023
@julianwiedmann julianwiedmann marked this pull request as ready for review April 18, 2023 06:04
@julianwiedmann julianwiedmann requested a review from a team as a code owner April 18, 2023 06:04
@julianwiedmann
Copy link
Member Author

build-commits timed out, but is ✔️ locally.

@julianwiedmann
Copy link
Member Author

/test-1.16-4.19

@julianwiedmann
Copy link
Member Author

/test-1.17-4.19

@julianwiedmann
Copy link
Member Author

/test-1.18-4.19

@julianwiedmann
Copy link
Member Author

/test-1.19-4.19

@julianwiedmann
Copy link
Member Author

/test-1.20-4.19

@julianwiedmann
Copy link
Member Author

/test-1.21-4.19

@julianwiedmann
Copy link
Member Author

/test-1.22-4.19

@julianwiedmann
Copy link
Member Author

/test-1.23-4.19

@julianwiedmann
Copy link
Member Author

/test-1.24-4.19

@julianwiedmann julianwiedmann added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 19, 2023
@julianwiedmann
Copy link
Member Author

julianwiedmann commented Apr 19, 2023

Tested locally that backend with L7 policy now works in combination with #24807.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants