v1.13 backports 2023-07-12 #26792

pchaigno · 2023-07-12T19:17:28Z

bpf: Don't encrypt on path hostns -> remote pod #25440 -- bpf: Don't encrypt on path hostns -> remote pod (@pchaigno)
- Many small conflicts all around. Usually easy to resolve because we are removing code anyway.
Fix upgrade for IPsec with tunneling #26708 -- Fix upgrade for IPsec with tunneling (@pchaigno)
- Minor conflicts for code that was removed anyway or code that was moved in the same file.

Once this PR is merged, you can update the PR labels via:

$ for pr in 25440 26708; do contrib/backporting/set-labels.py $pr done 1.13; done

[ upstream commit fd6fa25 ] TL;DR. this commit removes a bit of dead code that seems to have been intended for IPsec in native routing mode but is never actually executed. These code paths are only executed if going through cilium_host and coming from the host (see !from_host check above). For remote destinations, we only go through cilium_host if the destination is part of a remote pod CIDR and we are running in tunneling mode. In native routing mode, we go straight to the native device. Example routing table for tunneling (10.0.0.0/24 is the remote pod CIDR): 10.0.0.0/24 via 10.0.1.61 dev cilium_host src 10.0.1.61 mtu 1373 <- we follow this 10.0.1.0/24 via 10.0.1.61 dev cilium_host src 10.0.1.61 10.0.1.61 dev cilium_host scope link 192.168.56.0/24 dev enp0s8 proto kernel scope link src 192.168.56.11** Example routing table for native routing: 10.0.0.0/24 via 192.168.56.12 dev enp0s8 <- we follow this 10.0.1.0/24 via 10.0.1.61 dev cilium_host src 10.0.1.61 10.0.1.61 dev cilium_host scope link 192.168.56.0/24 dev enp0s8 proto kernel scope link src 192.168.56.11 Thus, this code path is only used for tunneling with IPsec. However, IPsec in tunneling mode should already be handled by the encap_and_redirect_with_nodeid call above in the same functions (see info->key argument). So why was this added? It was added in commit b76e6eb ("cilium: ipsec, support direct routing modes") to support "direct routing modes". I found that very suspicious because, per the above, in native routing mode, traffic from the hostns shouldn't even go through cilium_host. I thus tested it out. I've checked IPsec with native routing mode, with and without endpoint routes. I can confirm that, in all those cases, traffic from the hostns is not encrypted when going to a remote pod. Therefore, this code is dead. I'm unsure when it died. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

[ upstream commit 5fe2b2d ] In pod-to-pod encryption with IPsec and tunneling, Cilium currently encrypts traffic on the path hostns -> remote pod even though traffic is in plain-text on the path remote pod -> hostns. When using native routing, neither of those paths is encrypted because traffic from the hostns doesn't go through the bpf_host BPF program. Cilium's Transparent Encryption with IPsec aims at encrypting pod-to-pod traffic. It is therefore unclear why we are encrypting traffic from the hostns. The simple fact that only one direction of the connection is encrypted begs the question of its usefulness. It's possible that this traffic was encrypted by mistake: some of this logic is necessary for node-to-node encryption with IPsec (not supported anymore) and pod-to-pod encryption may have been somewhat simplified to encrypt *-to-pod traffic. Encrypting traffic from the hostns nevertheless creates several issues. First, this situation creates a path asymmetry between the forward and reply paths of hostns<>remote pod connections. Path asymmetry issues are well known to be a source of bugs, from of '--ctstate INVALID -j DROP' iptables rules to NAT issues. Second, Gray recently uncovered a separate bug which, when combined with this encryption from hostns, can prevent Cilium from starting. That separate bug is still being investigated but it seems to cause the reload of bpf_host to depend on Cilium connecting to etcd in a clustermesh context. If this etcd is a remote pod, Cilium connects to it on hostns -> remote pod path. The bpf_host program being unloaded[1], it fails. We end up in a cyclic dependency: bpf_host requires connectivity to etcd, connectivity to etcd requires bpf_host. This commit therefore removes encryption with IPsec for the path hostns -> remote pod when using tunneling (already unencrypted when using native routing). 1 - More specifically, in Gray's case, the bpf_host program is already loaded, but it needs to be reloaded because the IPsec XFRM config changed. Without this reload, encryption fails. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

[ upstream commit 3b3e8d0 ] For the similar reasons as in the previous commit, we don't want to encrypt traffic going from a pod to the CiliumInternalIP. This is currently the only node IP address type that is associated an encryption key. Since we don't encrypt traffic from the hostns to remote pods anymore (see previous commit), encrypting traffic going to a CiliumInternalIP (remote node) would result in a path asymmetry: traffic going to the CiliumInternalIP would be encrypted, whereas reply traffic coming from the CiliumInternalIP wouldn't. This commit removes that caseand therefore ensures we never encrypt traffic going to a node IP address. Reported-by: Gray Lian <gray.liang@isovalent.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

[ upstream commit ebd02f1 ] On IPAM modes with one pod CIDR per node, the XFRM OUT policies look like below: src 10.0.1.0/24 dst 10.0.0.0/24 dir out priority 0 ptype main mark 0x66d11e00/0xffffff00 tmpl src 10.0.1.13 dst 10.0.0.214 proto esp spi 0x00000001 reqid 1 mode tunnel When sending traffic from the hostns, however, it may not match the source CIDR above. Traffic from the hostns may indeed leave the node with the NodeInternalIP as the source IP (vs. CiliumInternalIP which would match). In such cases, we don't match the XFRM OUT policy and fall back to the catch-all default-drop rule, ending up with XfrmOutPolBlock packet drops. Why wasn't this an issue before? It was. Traffic would simply go in plain-text (which is okay given we never intended to encrypt hostns traffic in the first place). What changes is that we now have a catch-all default-drop XFRM OUT policy to avoid leaking plain-text traffic. So it now results in XfrmOutPolBlock errors. In commit 5fe2b2d ("bpf: Don't encrypt on path hostns -> remote pod") we removed encryption for the path hostns -> remote pod. Unfortunately, that doesn't mean the issue is completely gone. On a new Cilium install, we won't see this issue of XfrmOutPolBlock drops for hostns traffic anymore. But on existing clusters, we will still see those drops during the upgrade, after the default-drop rule is installed but before hostns traffic encryption is removed. None of this is an issue on AKS and ENI IPAM modes because there, the XFRM OUT policies look like: src 0.0.0.0/0 dst 10.0.0.0/16 dir out priority 0 ptype main mark 0x66d11e00/0xffffff00 tmpl src 10.0.1.13 dst 10.0.0.214 proto esp spi 0x00000001 reqid 1 mode tunnel Thus, hostns -> remote pod traffic is matched regardless of the source IP being selected and packets are not dropped by the default-drop rule. We can therefore avoid the upgrade drops by changing the XFRM OUT policies to never match on the source IPs, as on AKS and ENI IPAM modes. Fixes: 7d44f37 ("ipsec: Catch-default default drop policy for encryption") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

[ upstream commit 0a8f2c4 ] Commits 3b3e8d0 ("node: Don't encrypt traffic to CiliumInternalIP") and 5fe2b2d ("bpf: Don't encrypt on path hostns -> remote pod") removed a path asymmetry on the paths hostns <> remote pod. They however failed to remove workarounds that we have for this path asymmetry. In particular, we would encrypt packets on the path pod -> remote node (set SPI in the node manager) to try and avoid the path asymmetry, by also encrypting the replies. And, as a result of this first change, we would also need to handle a corner case in the datapath to appluy the correct reverse DNAT for reply traffic. All of that is unnecessary now that we fixed the path asymmetry. Fixes: 3b3e8d0 ("node: Don't encrypt traffic to CiliumInternalIP") Fixes: 5fe2b2d ("bpf: Don't encrypt on path hostns -> remote pod") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

[ upstream commit 420d7fa ] Commits 4c7cce1 ("bpf: Remove IP_POOLS IPsec code") and 19a62da ("bpf: Lookup tunnel endpoint for IPsec rewrite") changed the way we pass the tunnel endpoint. We used to pass it via skb->cb[4] and read it in bpf_host to build the encapsulation. We changed that in the above commits to pass the identity via skb->cb[4] instead. Therefore, on upgrades, for a short while, we may end up with bpf_lxc writing the identity into skb->cb[4] (new datapath version) and bpf_host interpreting it as the tunnel endpoint (old version). Reloading bpf_host before bpf_lxc is not enough to fix it because then, for a short while, bpf_lxc would write the tunnel endpoint in skb->cb[4] (old version) and bpf_host would interpret it as the security identity (new version). In addition to reloading bpf_host first, we also need to make sure that it can handle both cases (skb->cb[4] has tunnel endpoint or identity). To distinguish between those two cases and interpret skb->cb[4] correctly, bpf_host will rely on the first byte: in the case of the tunnel endpoint is can't be zero because that would mean the IP address is within the special purpose range 0.0.0.0/8; in the case of the identity, it has to be zero because identities are on 24-bits. This commit implements those changes. Commit ca9c056 ("daemon: Reload bpf_host first in case of IPsec upgrade") had already made the agent reload bpf_host first for ENI and Azure IPAM modes, so we just need to extend it to all IPAM modes. Note that the above bug on upgrades doesn't cause an immediate packet drop at the sender. Instead, it seems the packet is encrypted twice. The (unverified) assumption here is that the encapsulation is skipped because the tunnel endpoint IP address is invalid (being a security identity on 24-bits). The encrypted packet is then sent again to cilium_host where the encryption bit is reapplied (given the destination IP address is a CiliumInternalIP). And it goes through the XFRM encryption again. On the receiver's side, the packet is decrypted once as expected. It is then recirculated to bpf_overlay which removes the packet mark. Given it is still an ESP (encrypted) packet, it goes back through the XFRM decryption layer. But since the packet mark is now zero, it fails to match any XFRM IN state. The packet is dropped with XfrmInNoStates. This can be seen in the following trace: <- overlay encrypted flow 0x6fc46fc4 , identity unknown->unknown state new ifindex cilium_vxlan orig-ip 0.0.0.0: 10.0.9.91 -> 10.0.8.32 -> stack encrypted flow 0x6fc46fc4 , identity 134400->44 state new ifindex cilium_vxlan orig-ip 0.0.0.0: 10.0.9.91 -> 10.0.8.32 <- overlay encrypted flow 0x6fc46fc4 , identity unknown->unknown state new ifindex cilium_vxlan orig-ip 0.0.0.0: 10.0.9.91 -> 10.0.8.32 -> host from flow 0x6fc46fc4 , identity unknown->43 state unknown ifindex cilium_vxlan orig-ip 0.0.0.0: 10.0.9.91 -> 10.0.8.32 -> stack flow 0x6fc46fc4 , identity unknown->unknown state unknown ifindex cilium_host orig-ip 0.0.0.0: 10.0.9.91 -> 10.0.8.32 The packet comes from the overlay encrypted, is sent to the stack to be decrypted, and comes back still encrypted. Fixes: 4c7cce1 ("bpf: Remove IP_POOLS IPsec code") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

pchaigno · 2023-07-12T20:26:33Z

/test-backport-1.13

pchaigno · 2023-07-12T21:59:59Z

I manually tested:

Scale up and down
Upgrade from v1.12.7
Upgrade from v1.13.0
Key rotation

On:

EKS + overlay
EKS + ENI

Overlay:
I had XfrmOutPolBlock errors during scale up/down, but those are expected given John's fixes are not in yet.
Also XfrmOutPolBlock errors during the key rotation. Unrelated as I can reproduce starting with v1.13.2.

ENI:
XfrmOutPolBlock errors during the key rotation. Starting with v1.13.2, same as in overlay case.

pchaigno · 2023-07-13T10:45:37Z

bpf/bpf_host.c

-	if ((magic & MARK_MAGIC_HOST_MASK) == MARK_MAGIC_ENCRYPT) {
-		ctx->mark = magic; /* CB_ENCRYPT_MAGIC */
-		src_id = ctx_load_meta(ctx, CB_ENCRYPT_IDENTITY);


For reviewers: On a previous backport of this change, Julian commented:

Just to note that in v1.13 the wireguard implementation is different, and from-container uses set_encrypt_mark(). So I hope our reasoning why we were able to remove this section in main also applies to v1.13...

Martynas confirmed that this code isn't used in the case of WireGuard on v1.13 and older. Instead, traffic marked for WireGuard encryption is sent to cilium_wg0 and then traverses to-host without a mark.

I must have not been paying sufficient attention during that review. In the from-container path, Wireguard uses set_encrypt_mark() - so the MARK_MAGIC_ENCRYPT ends up in skb->mark. But here we obtained the magic value from the skb->cb (note the ctx_load_meta()).

So unless there's some magic at play that transfers the skb->mark to the skb->cb, there's no way this could have even conflicted. Sorry for burning your cycles on this :(.

No worries at all. Better have unnecessary checks than missing something.

jschwinger233

👍

julianwiedmann

lgtm, thank you!

pchaigno added 6 commits July 12, 2023 19:27

pchaigno added upgrade-impact This PR has potential upgrade or downgrade impact. area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. labels Jul 12, 2023

maintainer-s-little-helper bot added backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. labels Jul 12, 2023

pchaigno commented Jul 13, 2023

View reviewed changes

pchaigno requested review from jschwinger233 and julianwiedmann July 13, 2023 15:36

pchaigno marked this pull request as ready for review July 13, 2023 15:37

pchaigno requested a review from a team as a code owner July 13, 2023 15:37

jschwinger233 approved these changes Jul 14, 2023

View reviewed changes

julianwiedmann approved these changes Jul 14, 2023

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 14, 2023

julianwiedmann merged commit 982eb76 into cilium:v1.13 Jul 14, 2023
62 checks passed

pchaigno deleted the pr/v1.13-backport-2023-07-12 branch July 14, 2023 12:58

pchaigno mentioned this pull request Jul 20, 2023

v1.13: node: Fix node encryption condition in incorrect backport #26953

Merged

gentoo-root mentioned this pull request Jul 26, 2023

Prepare for release v1.13.5 #27088

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.13 backports 2023-07-12 #26792

v1.13 backports 2023-07-12 #26792

pchaigno commented Jul 12, 2023 •

edited

pchaigno commented Jul 12, 2023

pchaigno commented Jul 12, 2023 •

edited

pchaigno Jul 13, 2023

julianwiedmann Jul 14, 2023 •

edited

pchaigno Jul 14, 2023

jschwinger233 left a comment

julianwiedmann left a comment

v1.13 backports 2023-07-12 #26792

v1.13 backports 2023-07-12 #26792

Conversation

pchaigno commented Jul 12, 2023 • edited

pchaigno commented Jul 12, 2023

pchaigno commented Jul 12, 2023 • edited

pchaigno Jul 13, 2023

Choose a reason for hiding this comment

julianwiedmann Jul 14, 2023 • edited

Choose a reason for hiding this comment

pchaigno Jul 14, 2023

Choose a reason for hiding this comment

jschwinger233 left a comment

Choose a reason for hiding this comment

julianwiedmann left a comment

Choose a reason for hiding this comment

pchaigno commented Jul 12, 2023 •

edited

pchaigno commented Jul 12, 2023 •

edited

julianwiedmann Jul 14, 2023 •

edited