Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipsec: Clear node ID and SPI with output-mark #28258

Merged

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented Sep 25, 2023

TL;DR. This pull request works around a Cilium iptables bug that can affect IPsec. The root of the issue for IPsec is a conflict between the packet mark we use and some iptables-nft rules. The workaround changes the packet mark to avoid the conflict.

Cilium's IPsec implementation relies on the packet mark to match packets against XFRM rules. In particular, the packet mark holds the node ID (0xffff0000), the SPI (0xf000), and whether to encrypt or decrypt (0xf00). On egress, the packet mark is written in bpf_lxc before the packet is sent to the stack for encryption. The encryption layer retains the packet mark, except for the encryption bit (0xf00) in some cases. bpf_host then processes the encrypted packet and clears the mark.

In some corner cases (what causes this isn't clear yet), some OSes can end up with kube-proxy rules in both iptables-nft and iptables-legacy. Cilium is currently unable to handle that situation properly: it should install iptables rules to skip some of kube-proxy's rules, but only does that for iptables-legacy. We can therefore end up with the following rules matching outgoing packets:

[31061:1736670] -A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
[31059:1736390] -A KUBE-POSTROUTING -m mark ! --mark 0x4000/0x4000 -j RETURN
[2:280] -A KUBE-POSTROUTING -j MARK --set-xmark 0x4000/0x0
[2:280] -A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE --random-fully

We can see that kube-proxy install a KUBE-POSTROUTING chain in the POSTROUTING netfilter chain. That chain skips all packets not marked with 0x4XXX. Other packets see the mark removed and are then masqueraded. In the above example, 2 packets had the 0x4XXX mark and were masqueraded.

These rules can end up matching the mark of our encrypted packets, if the SPI is 4. Encrypted packets are then incorrectly masqueraded. Depending on the node and network configuration, they may end up dropped.

The SPI is incremented on key rotation (from 1 to 15, then going back to 1 again). Users performing key rotations are therefore highly likely to hit this bug if they have iptables-nft kube-proxy rules installed.

We can work around this bug by clearing all packet mark bits we don't need anymore after the encryption. Once the packet is encrypted, bpf_host only needs the 0xf00 part of the mark to determine if a packet is encrypted or not. The node ID and SPI can be cleared from the mark. This pull request therefore clears those bits right after encryption, with output-mark, on egress and ingress.

Note this only avoids the IPsec impact of this bug, but doesn't fix the underlying issue with iptables-nft. More work is needed to have the agent correctly handle this situation.

Fix a bug that causes pod-to-pod traffic between nodes to be dropped when IPsec is enabled and kube-proxy installed rules in both iptables-nft and iptables-legacy.

@pchaigno pchaigno added kind/bug This is a bug in the Cilium logic. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/bug This PR fixes an issue in a previous release of Cilium. area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. needs-backport/1.12 needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Sep 25, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.14.3 Sep 25, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.13.8 Sep 25, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.12.15 Sep 25, 2023
@pchaigno pchaigno force-pushed the fix-ipsec-mark-conflict-iptables-nft branch from fd42c29 to 21f5ac5 Compare September 28, 2023 12:31
TL;DR. This commit works around a Cilium iptables bug that can affect
IPsec. The root of the issue for IPsec is a conflict between the packet
mark we use and some iptables-nft rules. The workaround changes the
packet mark to avoid the conflict.

Cilium's IPsec implementation relies on the packet mark to match
packets against XFRM rules. In particular, the packet mark holds the
node ID (0xffff0000), the SPI (0xf000), and whether to encrypt or
decrypt (0xf00). On egress, the packet mark is written in bpf_lxc before
the packet is sent to the stack for encryption. The encryption layer
retains the packet mark, except for the encryption bit (0xf00) in some
cases. bpf_host then processes the encrypted packet and clears the mark.

In some corner cases (what causes this isn't clear yet), some OSes can
end up with kube-proxy rules in both iptables-nft and iptables-legacy.
Cilium is currently unable to handle that situation properly: it should
install iptables rules to skip some of kube-proxy's rules, but only does
that for iptables-legacy. We can therefore end up with the following
rules matching outgoing packets:

    [31061:1736670] -A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
    [31059:1736390] -A KUBE-POSTROUTING -m mark ! --mark 0x4000/0x4000 -j RETURN
    [2:280] -A KUBE-POSTROUTING -j MARK --set-xmark 0x4000/0x0
    [2:280] -A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE --random-fully

We can see that kube-proxy install a KUBE-POSTROUTING chain in the
POSTROUTING netfilter chain. That chain skips all packets not marked
with 0x4XXX. Other packets see the mark removed and are then
masqueraded. In the above example, 2 packets had the 0x4XXX mark and
were masqueraded.

These rules can end up matching the mark of our encrypted packets, if
the SPI is 4. Encrypted packets are then incorrectly masqueraded.
Depending on the node and network configuration, they may end up
dropped.

The SPI is incremented on key rotation (from 1 to 15, then going back to
1 again). Users performing key rotations are therefore highly likely to
hit this bug if they have iptables-nft kube-proxy rules installed.

We can work around this bug by clearing all packet mark bits we don't
need anymore after the encryption. Once the packet is encrypted,
bpf_host only needs the 0xf00 part of the mark to determine if a packet
is encrypted or not. The node ID and SPI can be cleared from the mark.
This commit therefore clears those bits right after encryption, with
output-mark, on egress and ingress.

Note this only avoids the IPsec impact of this bug, but doesn't fix the
underlying issue with iptables-nft. More work is needed to have the
agent correctly handle this situation.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
@pchaigno pchaigno force-pushed the fix-ipsec-mark-conflict-iptables-nft branch from 21f5ac5 to 80fb65d Compare September 29, 2023 10:08
@pchaigno pchaigno changed the title ipsec: Cover node ID and SPI with output-mark ipsec: Clear node ID and SPI with output-mark Sep 29, 2023
@pchaigno
Copy link
Member Author

/test

@pchaigno pchaigno marked this pull request as ready for review September 29, 2023 15:12
@pchaigno pchaigno requested review from a team as code owners September 29, 2023 15:12
@jschwinger233 jschwinger233 self-requested a review October 3, 2023 08:10
Copy link
Member

@jschwinger233 jschwinger233 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@pchaigno pchaigno merged commit def1770 into cilium:main Oct 3, 2023
61 checks passed
@pchaigno pchaigno deleted the fix-ipsec-mark-conflict-iptables-nft branch October 3, 2023 09:26
@sayboras sayboras mentioned this pull request Oct 6, 2023
9 tasks
@sayboras sayboras added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Oct 6, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.14 in 1.14.3 Oct 6, 2023
@julianwiedmann julianwiedmann added release-blocker/1.12 This issue will prevent the release of the next version of Cilium. release-blocker/1.13 This issue will prevent the release of the next version of Cilium. release-blocker/1.14 This issue will prevent the release of the next version of Cilium. labels Oct 6, 2023
@sayboras sayboras mentioned this pull request Oct 6, 2023
5 tasks
@sayboras sayboras added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Oct 6, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.13 in 1.13.8 Oct 6, 2023
@sayboras sayboras mentioned this pull request Oct 6, 2023
5 tasks
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.12 in 1.12.15 Oct 6, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Backport pending to v1.12 in 1.12.15 Oct 6, 2023
@github-actions github-actions bot added backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Oct 6, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Backport pending to v1.13 in 1.13.8 Oct 6, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Backport done to v1.12 in 1.12.15 Oct 6, 2023
@github-actions github-actions bot added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Oct 9, 2023
@jrajahalme jrajahalme moved this from Backport pending to v1.14 to Backport done to v1.14 in 1.14.3 Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. kind/bug This is a bug in the Cilium logic. release-blocker/1.12 This issue will prevent the release of the next version of Cilium. release-blocker/1.13 This issue will prevent the release of the next version of Cilium. release-blocker/1.14 This issue will prevent the release of the next version of Cilium. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
No open projects
1.12.15
Backport done to v1.12
1.14.3
Backport done to v1.14
Development

Successfully merging this pull request may close these issues.

None yet

4 participants