-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v1.13] backport for DSR handling in from/to-netdev #24795
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
maintainer-s-little-helper
bot
added
backport/1.13
This PR represents a backport for Cilium 1.13.x of a PR that was merged to main.
kind/backports
This PR provides functionality previously merged into master.
labels
Apr 10, 2023
/test-backport-1.13 |
[ upstream commit 997be00 ] Currently we first extract the ports from the packet in their actual order, and then store them into the CT tuple in reverse order. Just follow the established pattern and immediately store into the CT tuple in reverse order. While at it also switch to the l4_load_ports() helper. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 0d62815 ] The preceding revalidate_data() always locates the L3 header at `data + ETH_HLEN`. So there's no need for a more generic approach when determining the L4 offset a few lines later. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit bab80db ] Let bpf_lxc provide a validated L3 header to the low-level DSR code. Also pass this on to snat_*_create_dsr(). Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit cff1e49 ] Introduce a proper struct to get rid of all the complicated masking, and clearly spell out the .type and .len parts of the option. Use this struct to fill & extract the DSR info. The on-wire format stays the same, and this patch doesn't try to solve the endianness confusion for the .port and .addr fields. When fetching the IPv4 option words, use a single ctx_load_bytes(). We've previously checked ip4->ihl, so there's no risk that the packet might be too short. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit d0a1450 ] The pre-defined ports are in network byte-order, do the same for the IP addresses. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 0991c24 ] Add tests for forwarding to a remote backend with DSR, where the LB needs to insert the IPv4 Option. We're not testing local-backend here (it's identical to non-DSR), and there's no reply path trough the LB node. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 18acac5 ] Remove some dead code. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit ef8d9d9 ] We currently have IPv4 and IPv6 variants of several helpers that update a CT entry. These only differ in the CT tuple parameter, which then gets used as key for a map lookup. To avoid this duplication, take ct_update_nodeport() as example and turn the CT tuple into an opaque parameter. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit b4a10f9 ] As part of the nodeport SVC lookup, lb6_extract_key() currently extracts the packet's destination port. Refactor this code path to use lb6_extract_tuple() instead, thus loading _both_ L4 ports. We can use this in a subsequent patch to slim down the CT lookup code. lb6_extract_key() turns into a much simpler helper that only fills the lb6_key. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 2c80ee5 ] This helper is now only used for IPv4 packets. Remove the IPv6-specific parts. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 28dfbaa ] We already have a validated IPv4 header, so avoid jumping through ipv4_ct_extract_l4_ports() for loading the ports. Stick close to how lb4_extract_l4_port() lays out the code, as we want to start using lb4_extract_tuple() in more places and the 4.19 verifier is very picky about it. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit f6862ff ] Same as for lb6_extract_tuple(). Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 1c73ddf ] Trust that earlier stages have already checked the L4 protocol, and we're not suddenly processing a request with non-SVC protocol. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 05e6c35 ] The SVC-to-backend DNAT code already updates the .daddr in the CT tuple. Also update the backend port, so that the nodeport code can use this tuple as-is to create an CT entry for RevDNAT of non-DSR connections. Slightly pull down the .daddr update in lb6_local(), to keep the code in sync with lb4_local() and have the whole tuple update in one place. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit b6c4e57 ] Enforce proper types. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit eee902f ] Allow the LB paths to perform a CT lookup without validating the L3 protocol or extracting the L4 ports. This helps to reduce instruction count in the nodeport code. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 284f62f ] As the LB paths are doing their own L3 protocol checks and also extract the L4 ports on their own, we can switch lb_local() and all the nodeport code to use the optimized CT lookup helpers. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit a00b74d ] DSR replies currently get their RevDNAT processing in from-container. That doesn't work for hostNetwork backends, or when from-container sends the packet off to the stack for L7 policy. Commit 54a8631 ("bpf: nodeport: handle revDNAT for local backends at to-netdev/to-overlay") recently fixed the same scenario for local backends, by adding a RevDNAT stage in to-netdev. Extend this code to also handle DSR replies. Note that right now no traffic should match this case, we'll only create the necessary CT entries in a subsequent patch. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 05d084b ] Let the nodeport code detect the DSR-specific Option/Extension, and create a corresponding CT entry (same CT tuple as the pod-level CT entry, but with CT_EGRESS). This is similar to what we do for service traffic to a local backend. We do this in a separate tail-call for complexity reasons, nodeport_lb() is just getting too crowded. The DSR reply path also requires a SNAT entry to handle the RevDNAT, so create that as well. Existing DSR connections continue to use their established SNAT entries, and trigger the RevDNAT from the .dsr flag in their pod-level CT entry. Their SNAT entry is protected from GC by the pod-level CT entry. New DSR connections live off the state created by nodeport_lb4(), and only get revDNATed in to-netdev. Meaning that reply traffic by a pod for such connections won't be RevDNATed as it passes through ie. netfilter. This matches the inbound direction. If a packet belongs to a DSR connection (ie gets handled by tail_nodeport_dsr_ingress_*()), we know it's not reply traffic. Thus we don't need to jump through CILIUM_CALL_*_NODEPORT_NAT_INGRESS afterwards, and can circle straight back to CILIUM_CALL_*_FROM_NETDEV. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 0d81d23 ] TCP connections have the DSR info only in their SYN packet. So for any subsequent packet we don't detect that it belongs to a DSR connection, and thus don't step through the nodeport CT code. Manually identify packets for such connections, and send them down the DSR-CT path so that they can do a proper ct_lookup() to update state and stats. This requires some changes to that CT code, so that it tolerates being called without DSR info. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit e01a9bc ] Responsibility for creating the DSR SNAT entry has moved into the nodeport code. Its corresponding CT entry looks the same as what would be created on a pod level (ie. ingressCTKeyFromEgressNatKey()), except that we intentionally flip the direction (TUPLE_F_OUT instead of TUPLE_F_IN). Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 93d12ee ] All callers currently pass ETH_HLEN, but we need a bit more flexibility when using the helper in unit tests. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit db5f376 ] We'll need to insert IPv4 options for DSR tests. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit db6f972 ] Test that an inbound request with DSR IP Option results in the expected CT and SNAT entry, and gets passed up to the backend. The reply path then uses the SNAT entry that for RevDNAT of a reply packet. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit d8799e6 ] When DSR traffic arrives at a backend node, don't bother tracking its L2 source address. Its the LB's MAC address, but the replies need to go straight to the client. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 995081d ] Shuffle the two DSR-INGRESS tail-calls into the ENABLE_DSR-guarded section, so that we can add calls to DSR-only code to them in a subsequent patch. Their callers are already guarded in the same fashion. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 751da7f ] We need to free up the CB_SRC_LABEL slot, but at least for IPv6 all skb->cb slots are already in use. So stop using the CB for transporting the DSR info, but just extract it a second time inside the tail-call. There's opportunities to clean this up more, and also alternative approaches (eg. use a per-CPU map entry). But let's keep things intentationally simple for now. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
[ upstream commit 2c05517 ] All nodeport code paths that tail-call back to IPV*_FROM_NETDEV need to restore the CB_SRC_LABEL with `src_identity`. We missed doing so for the DSR-Ingress tail-calls. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann
force-pushed
the
v1.13-dsr
branch
from
April 17, 2023 13:32
b32c661
to
7a84ff5
Compare
/test-backport-1.13 |
julianwiedmann
changed the title
WIP [v1.13] dsr backport
[v1.13] backport for DSR handling in from/to-netdev
Apr 17, 2023
|
/test-1.16-4.19 |
/test-1.17-4.19 |
/test-1.18-4.19 |
/test-1.19-4.19 |
/test-1.20-4.19 |
/test-1.21-4.19 |
/test-1.22-4.19 |
/test-1.23-4.19 |
/test-1.24-4.19 |
julianwiedmann
added
the
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
label
Apr 19, 2023
Tested locally that backend with L7 policy now works in combination with #24807. |
2 tasks
2 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
backport/1.13
This PR represents a backport for Cilium 1.13.x of a PR that was merged to main.
kind/backports
This PR provides functionality previously merged into master.
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This brings a whole bunch of PRs into v1.13 to fix the DSR support for backends in
hostNetwork
, or with L7 policy (#21955 (comment)).Once this PR is merged, you can update the PR labels via:
or with