-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.13 backports 2023-06-05 #25897
Merged
dylandreimerink
merged 16 commits into
cilium:v1.13
from
pchaigno:pr/v1.13-backport-2023-06-05
Jun 6, 2023
Merged
v1.13 backports 2023-06-05 #25897
dylandreimerink
merged 16 commits into
cilium:v1.13
from
pchaigno:pr/v1.13-backport-2023-06-05
Jun 6, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[ upstream commit 600c7d4 ] The XFRM IN policies and states didn't change so we should never need to remove any stale XFRM IN configs. Let's thus simplify the logic to find stale policies and states accordingly. I would expect this incorrect removal to cause a few drops on agent restart, but after multiple attempts to reproduce on small (3 nodes) and larger (20) clusters (EKS & GKE) with a drop-sensitive application (migrate-svc), I'm not able to see such drops. I'm guessing this is because we reinstall the XFRM IN configs right after we removed them so there isn't really much time for a packet to be received and dropped. Fixes: 688dc9a ("ipsec: Remove stale XFRM states and policies") Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit bbc50a3 ] This small bit of refactoring will make later changes a bit easier. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 2ada282 ] No functional changes. Best viewed with git show -b or the equivalent on GitHub to not show space-only changes. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 64f4c23 ] No functional changes. Best viewed with git show -b or the equivalent on GitHub to not show space-only changes. Same as the previous commit but on a different set of conditions. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 66c45ac ] This is a partial revert of commit 3e59b68 ("ipsec: Per-node XFRM states & policies for EKS & AKS"). One change that commit 3e59b68 ("ipsec: Per-node XFRM states & policies for EKS & AKS") make on EKS and AKS was to switch from using NodeInternalIPs to using CiliumInternalIPs for outer IPsec (ESN) IP addresses. That made the logic more consistent with the logic we use for other IPAM schemes (e.g., GKE). It however causes serious connectivity issues on upgrades and downgrades. This is mostly because typically not all nodes are updated to the new Cilium version at the same time. If we consider two pods on nodes A and B trying to communicate, then node A may be using the old NodeInternalIPs while node B is already on the new CiliumInternalIPs. When node B sends traffic to node A, node A doesn't have the XFRM state IN necessary to decrypt it. The same happens in the other direction. This commit reintroduces the NodeInternalIPs for EKS and AKS. Subsequent commits will introduce additional changes to enable a proper migration path from NodeInternalIPs to CiliumInternalIPs. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 6b3b50d ] On EKS and AKS, we currently use NodeInternalIPs for the IPsec tunnels. A subsequent commit will allow us to change that to switch to using CiliumInternalIPs (as done on GKE). For that to be possible without breaking inter-node connectivity for the whole duration of the switch, we need an intermediate mode where both CiliumInternalIPs and NodeInternalIPs are accepted on ingress. The idea is that we will then have a two-steps migration from NodeInternalIP to CiliumInternalIP: 1. All nodes are using NodeInternalIP. 2. Upgrade to the version of Cilium that supports both NodeInternalIP and CiliumInternalIP and encapsulates IPsec traffic with NodeInternalIP. 3. Via an agent flag, tell Cilium to switch to encapsulating IPsec traffic with CiliumInternalIP. 4. All nodes are using CiliumInternalIP. This commit implements the logic for step 2 above. To that end, we will duplicate the XFRM IN states such that we have both: src 0.0.0.0 dst [NodeInternalIP] # existing src 0.0.0.0 dst [CiliumInternalIP] # new thus matching and being able to receive IPsec packets with an outer destination IP of either NodeInternalIP or CiliumInternalIP. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 963e45b ] On EKS and AKS, IPsec used NodeInternalIPs for the encapsulation. This commit introduces a new flag to allow switching from NodeInternalIPs to CiliumInternalIPs; it defaults to the former. This new flag allows for step 3 of the migration plan defined in the previous commit. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit e880002 ] reinitializeIPSec only runs the interface detection if EncryptInterface is empty. Since it sets it after detecting interfaces, it will only run the detection once. Let's change that to run the detection even if the EncryptInterface list isn't empty. That will allow us to rerun the detection when new ENI devices are added on EKS. One consequence of this change is that we will now attach to all interfaces even if the user configured --encrypt-interface. That is fine because --encrypt-interface shouldn't actually be used in ENI mode. In ENI mode, we want to attach to all interfaces as we don't have a guarantee on which interface the IPsec traffic will come in. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Jussi Maki <jussi@isovalent.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit bf0940b ] If IPsec is enabled along with the ENI IPAM mode we need to load the bpf_network program onto new ENI devices when they're added at runtime. To fix this, we subscribe to netlink link updates to detect when new (non-veth) devices are added and reinitialize IPsec to load the BPF program onto the devices. The compilation of the bpf_netowrk program has been moved to Reinitialize() to avoid spurious recompilation on reinitialize. Signed-off-by: Jussi Maki <jussi@isovalent.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 3201a5e ] This commit simply refactors some existing code into a new getNodeIDForNode function. This function will be called from elsewhere in a subsequent commit. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 9cc8a89 ] Our logic to clean up old XFRM configs on node deletion currently relies on the destination IP to identify the configs to remove. That doesn't work with ENI and Azure IPAMs, but until recently, it didn't need to. On ENI and Azure IPAMs we didn't have per-node XFRM configs. That changed in commit 3e59b68 ("ipsec: Per-node XFRM states & policies for EKS & AKS"). We now need to clean up per-node XFRM configs for ENI and Azure IPAM modes as well, and we can't rely on the destination IP for that because the XFRM policies don't match on that destination IP. Instead, since commit 73c36d4 ("ipsec: Match OUT XFRM states & policies using node IDs"), we match the per-node XFRM configs using node IDs encoded in the packet mark. The good news is that this is true for all IPAM modes (whether Azure, ENI, cluster-pool, or something else). So our cleanup logic can now rely on the node ID of the deleted node to clean up its XFRM states and policies. This commit implements that. Fixes: 3e59b68 ("ipsec: Per-node XFRM states & policies for EKS & AKS") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 3e898f2 ] This commit lowers the priority of the catch-all default-drop XFRM OUT policies, from 1 to 100. For context, 0 is the highest possible priority. This change will allow us to introduce several levels of priorities for XFRM OUT policies in subsequent commits. Diff before/after this patch: src 0.0.0.0/0 dst 0.0.0.0/0 - dir out action block priority 1 + dir out action block priority 100 mark 0xe00/0xf00 Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit a11d088 ] This is a revert, or rather a reimplementation, of commit 688dc9a ("ipsec: Remove stale XFRM states and policies"). In that commit, we would remove the old XFRM OUT policies and states because they conflict with the new ones and prevent the installation to proceed. This removal however causes a short window of packet drops on upgrade, between the time the old XFRM configs are removed and the new ones are added. These drops would show up as XfrmOutPolBlock because packets then match the catch-all default-drop XFRM policy. Instead of removing the old XFRM configs, a better, less-disruptive approach is to deprioritize them and add the new ones in front. To that end, we "lower" the priority of the old XFRM OUT policy from 0 to 50 (0 is the highest-possible priority). By doing this the XFRM OUT state is also indirectly deprioritized because it is selected by the XFRM OUT policy. As with the code from commit 688dc9a ("ipsec: Remove stale XFRM states and policies"), this whole logic can be removed in v1.15, once we are sure that nobody is upgrading with the old XFRM configs in place. At that point, we will be able to completely clean up those old XFRM configs. The priority of 50 was chosen arbitrarily, to be between the priority of new XFRM OUT configs (0) and the priority of the catch-all default-drop policy (100), while leaving space if we need to add additional rules of different priorities. Diff before/after upgrade (v1.13.0 -> this patch, GKE): src 10.24.1.0/24 dst 10.24.2.0/24 - dir out priority 0 + dir out priority 50 mark 0x3e00/0xff00 tmpl src 10.24.1.77 dst 10.24.2.207 proto esp spi 0x00000003 reqid 1 mode tunnel Fixes: 688dc9a ("ipsec: Remove stale XFRM states and policies") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 0af2303 ] This commit reverts the bpf_host changes of commit 4c7cce1 ("bpf: Remove IP_POOLS IPsec code"). The IP_POOLS IPsec code was a hack to avoid having one XFRM OUT policy and state per remote node. Instead, we had a single XFRM OUT policy and state that would encrypt traffic as usual, but encapsulate it with placeholder IP addresses, such as 0.0.0.0 -> 192.168.0.0. Those outer IP addresses would then be rewritten to the proper IPs in bpf_host. To that end, bpf_lxc would pass the destination IP address, the tunnel endpoint, to bpf_host via a skb->cb slot. The source IP address was hardcoded in the object file. Commit 4c7cce1 ("bpf: Remove IP_POOLS IPsec code") thus got rid of that hack to instead have per-node XFRM OUT policies and states. The kernel therefore directly writes the proper outer IP addresses. Unfortunately, the transition from one implementation to the other isn't so simple. If we simply remove the old IP_POOLS code as done in commit 4c7cce1, then we will have drops on upgrade. We have two cases, depending on which of bpf_lxc or bpf_host is reloaded first: 1. If bpf_host is reloaded before the new bpf_lxc is loaded, then it won't rewrite the outer IP addresses anymore. In that case, we end up with packets of the form 0.0.0.0 -> 192.168.0.0 leaving on the wire. Obviously, they don't go far and end up dropped. 2. If bpf_lxc is reloaded before the new bpf_host, then it will reuse skb->cb for something else and the XFRM layer will handle the outer IP addresses. But because bpf_host is still on the old implementation, it will try to use skb->cb to rewrite the outer IP addresses. We thus end up with gibberish outer destination IP addresses. One way to fix this is to have bpf_host support both implementations. This is what this commit does. The logic to rewrite the outer IP addresses is reintroduced in bpf_host, but it is only executed if the outer source IP address is 0.0.0.0. That way, we will only rewrite the outer IP addresses if bpf_lxc is on the old implementation and the XFRM layer didn't write the proper outer IPs. Fixes: 4c7cce1 ("bpf: Remove IP_POOLS IPsec code") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit ca9c056 ] As explained in the previous commit, we need to switch our IPsec logic from one implementation to another. This implementation requires some synchronized work between bpf_lxc and bpf_host. To enable this switch without causing drops, the previous commit made bpf_host support both implementations. This is quite enough though. For this to work, we need to ensure that bpf_host is always reloaded before any bpf_lxc is loaded. That is, we need to load the bpf_host program that supports both implementations before we actually start the switch from one implementation to the second. This commit makes that change in the order of BPF program reloads. Instead of regenerating the bpf_host program (i.e., the host endpoint's datapath) in a goroutine like other BPF programs, we will regenerate it first, as a blocking operation. Regenerating the host endpoint's datapath separately like this will delay the agent startup. This regeneration was measured to take around 1 second on an EKS cluster (though it can probably grow to a few seconds depending on the node type and current load). That should stay fairly small compared to the overall duration of the agent startup (around 30 seconds). Nevertheless, this separate regeneration is only performed when we actually need: for IPsec with EKS or AKS IPAM mode. Fixes: 4c7cce1 ("bpf: Remove IP_POOLS IPsec code") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit c0d9b8c ] Commit 73c36d4 ("ipsec: Match OUT XFRM states & policies using node IDs") changed our XFRM states to match on packet marks of the form 0xXXXXYe00/0xffffff00 where XXXX is the node ID and Y is the SPI. The previous format for the packet mark in XFRM states was 0xYe00/0xff00. According to the Linux kernel these two states conflict (because 0xXXXXYe00/0xffffff00 ∈ 0xYe00/0xff00). That means we can't add the new state while the old one is around. Thus, in commit ddd491b ("ipsec: Custom check for XFRM state existence"), we removed any old conflicting XFRM state before adding the new ones. That however causes packet drops on upgrades because we may remove the old XFRM state before bpf_lxc has been updated to use the new 0xXXXXYe00/0xffffff00 mark. Instead, we would need both XFRM state formats to coexist for the duration of the upgrade. Impossible, you say! Don't despair. Things are actually a bit more complicated (it's IPsec and Linux after all). While Linux doesn't allow us to add 0xXXXXYe00/0xffffff00 when 0xYe00/0xff00 exists, it does allow adding in the reverse order. That seems to be because 0xXXXXYe00/0xffffff00 ∈ 0xYe00/0xff00 but 0xYe00/0xff00 ∉ 0xXXXXYe00/0xffffff00 [1]. Therefore, to have both XFRM states coexist, we can remove the old state, add the new one, then re-add the old state. That is allowed because we never try to add the new state when the old is present. During the short period of time when we have removed the old XFRM state, we can have a packet drops due to the missing state. These drops should be limited to the specific node pair this XFRM state is handling. This will also only happen on upgrades. Finally, this shouldn't happen with ENI and Azure IPAM modes because they don't have such old conflicting states. I tested this upgrade path on a 20-nodes GKE cluster running our drop-sensitive application, migrate-svc, scaled up to 50 clients and 30 backends. I didn't get a single packet drop despite the application consistently sending packets back and forth between nodes. Thus, I think the window for drops to happen is really small. Diff before/after the upgrade (v1.13.0 -> thi patch, GKE): src 10.24.1.77 dst 10.24.2.207 proto esp spi 0x00000003 reqid 1 mode tunnel replay-window 0 mark 0x3e00/0xff00 output-mark 0xe00/0xf00 aead rfc4106(gcm(aes)) 0xfc2d0c4e646b87ff2d0801b57997e3598eab0d6b 128 - anti-replay context: seq 0x0, oseq 0x2c, bitmap 0x00000000 + anti-replay context: seq 0x0, oseq 0x16, bitmap 0x00000000 sel src 0.0.0.0/0 dst 0.0.0.0/0 + src 10.24.1.77 dst 10.24.2.207 + proto esp spi 0x00000003 reqid 1 mode tunnel + replay-window 0 + mark 0x713f3e00/0xffffff00 output-mark 0xe00/0xf00 + aead rfc4106(gcm(aes)) 0xfc2d0c4e646b87ff2d0801b57997e3598eab0d6b 128 + anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000 + sel src 0.0.0.0/0 dst 0.0.0.0/0 We can notice that the counters for the existing XFRM state also changed (decreased). That's expected since the state got recreated. 1 - I think this is because XFRM states don't have priorities. So when two XFRM states would match a given packet (in our case a packet with mark XXXXYe00), the oldest XFRM state is taken. Thus, by not allowing to add a more specific match after a more generic one, the kernel ensures that the more specific match is always taken when both match a packet. That likely corresponds to user expectations. That is, if both 0xXXXXYe00/0xffffff00 and 0xYe00/0xff00 match a packet, we would probably expect 0xXXXXYe00/0xffffff00 to be used. Fixes: ddd491b ("ipsec: Custom check for XFRM state existence") Fixes: 73c36d4 ("ipsec: Match OUT XFRM states & policies using node IDs") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
pchaigno
added
area/encryption
Impacts encryption support such as IPSec, WireGuard, or kTLS.
release-blocker/1.13
This issue will prevent the release of the next version of Cilium.
labels
Jun 5, 2023
maintainer-s-little-helper
bot
added
backport/1.13
This PR represents a backport for Cilium 1.13.x of a PR that was merged to main.
kind/backports
This PR provides functionality previously merged into master.
labels
Jun 5, 2023
/test-backport-1.13 |
I tested my patches on 3-nodes GKE and EKS clusters, by upgrading from v1.13.0 to the images from this pull request. I checked that no packet drops occurred during the upgrade and that connectivity tests passed after the upgrade. |
joamaki
approved these changes
Jun 6, 2023
dylandreimerink
approved these changes
Jun 6, 2023
maintainer-s-little-helper
bot
added
the
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
label
Jun 6, 2023
This was referenced Jun 9, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/encryption
Impacts encryption support such as IPSec, WireGuard, or kTLS.
backport/1.13
This PR represents a backport for Cilium 1.13.x of a PR that was merged to main.
kind/backports
This PR provides functionality previously merged into master.
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
release-blocker/1.13
This issue will prevent the release of the next version of Cilium.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
XfrmInNoStates
drops on ESK & AKS upgrades #25724 -- ipsec: FixXfrmInNoStates
drops on ESK & AKS upgrades (@pchaigno)XfrmOutPolBlock
drops on upgrades #25735 -- ipsec: FixXfrmOutPolBlock
drops on upgrades (@pchaigno)Once this PR is merged, you can update the PR labels via: