-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipsec: Clear node ID and SPI with output-mark #28258
Merged
pchaigno
merged 1 commit into
cilium:main
from
pchaigno:fix-ipsec-mark-conflict-iptables-nft
Oct 3, 2023
Merged
ipsec: Clear node ID and SPI with output-mark #28258
pchaigno
merged 1 commit into
cilium:main
from
pchaigno:fix-ipsec-mark-conflict-iptables-nft
Oct 3, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pchaigno
added
kind/bug
This is a bug in the Cilium logic.
sig/datapath
Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
release-note/bug
This PR fixes an issue in a previous release of Cilium.
area/encryption
Impacts encryption support such as IPSec, WireGuard, or kTLS.
needs-backport/1.12
needs-backport/1.13
This PR / issue needs backporting to the v1.13 branch
needs-backport/1.14
This PR / issue needs backporting to the v1.14 branch
labels
Sep 25, 2023
pchaigno
force-pushed
the
fix-ipsec-mark-conflict-iptables-nft
branch
from
September 28, 2023 12:31
fd42c29
to
21f5ac5
Compare
TL;DR. This commit works around a Cilium iptables bug that can affect IPsec. The root of the issue for IPsec is a conflict between the packet mark we use and some iptables-nft rules. The workaround changes the packet mark to avoid the conflict. Cilium's IPsec implementation relies on the packet mark to match packets against XFRM rules. In particular, the packet mark holds the node ID (0xffff0000), the SPI (0xf000), and whether to encrypt or decrypt (0xf00). On egress, the packet mark is written in bpf_lxc before the packet is sent to the stack for encryption. The encryption layer retains the packet mark, except for the encryption bit (0xf00) in some cases. bpf_host then processes the encrypted packet and clears the mark. In some corner cases (what causes this isn't clear yet), some OSes can end up with kube-proxy rules in both iptables-nft and iptables-legacy. Cilium is currently unable to handle that situation properly: it should install iptables rules to skip some of kube-proxy's rules, but only does that for iptables-legacy. We can therefore end up with the following rules matching outgoing packets: [31061:1736670] -A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING [31059:1736390] -A KUBE-POSTROUTING -m mark ! --mark 0x4000/0x4000 -j RETURN [2:280] -A KUBE-POSTROUTING -j MARK --set-xmark 0x4000/0x0 [2:280] -A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE --random-fully We can see that kube-proxy install a KUBE-POSTROUTING chain in the POSTROUTING netfilter chain. That chain skips all packets not marked with 0x4XXX. Other packets see the mark removed and are then masqueraded. In the above example, 2 packets had the 0x4XXX mark and were masqueraded. These rules can end up matching the mark of our encrypted packets, if the SPI is 4. Encrypted packets are then incorrectly masqueraded. Depending on the node and network configuration, they may end up dropped. The SPI is incremented on key rotation (from 1 to 15, then going back to 1 again). Users performing key rotations are therefore highly likely to hit this bug if they have iptables-nft kube-proxy rules installed. We can work around this bug by clearing all packet mark bits we don't need anymore after the encryption. Once the packet is encrypted, bpf_host only needs the 0xf00 part of the mark to determine if a packet is encrypted or not. The node ID and SPI can be cleared from the mark. This commit therefore clears those bits right after encryption, with output-mark, on egress and ingress. Note this only avoids the IPsec impact of this bug, but doesn't fix the underlying issue with iptables-nft. More work is needed to have the agent correctly handle this situation. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
pchaigno
force-pushed
the
fix-ipsec-mark-conflict-iptables-nft
branch
from
September 29, 2023 10:08
21f5ac5
to
80fb65d
Compare
pchaigno
changed the title
ipsec: Cover node ID and SPI with output-mark
ipsec: Clear node ID and SPI with output-mark
Sep 29, 2023
/test |
jschwinger233
approved these changes
Oct 3, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
sayboras
added
backport-pending/1.14
The backport for Cilium 1.14.x for this PR is in progress.
and removed
needs-backport/1.14
This PR / issue needs backporting to the v1.14 branch
labels
Oct 6, 2023
maintainer-s-little-helper
bot
moved this from Needs backport from main
to Backport pending to v1.14
in 1.14.3
Oct 6, 2023
julianwiedmann
added
release-blocker/1.12
This issue will prevent the release of the next version of Cilium.
release-blocker/1.13
This issue will prevent the release of the next version of Cilium.
release-blocker/1.14
This issue will prevent the release of the next version of Cilium.
labels
Oct 6, 2023
sayboras
added
backport-pending/1.13
The backport for Cilium 1.13.x for this PR is in progress.
and removed
needs-backport/1.13
This PR / issue needs backporting to the v1.13 branch
labels
Oct 6, 2023
maintainer-s-little-helper
bot
moved this from Needs backport from main
to Backport pending to v1.13
in 1.13.8
Oct 6, 2023
maintainer-s-little-helper
bot
moved this from Needs backport from main
to Backport pending to v1.12
in 1.12.15
Oct 6, 2023
github-actions
bot
added
backport-done/1.12
The backport for Cilium 1.12.x for this PR is done.
backport-done/1.13
The backport for Cilium 1.13.x for this PR is done.
and removed
backport-pending/1.13
The backport for Cilium 1.13.x for this PR is in progress.
labels
Oct 6, 2023
github-actions
bot
added
backport-done/1.14
The backport for Cilium 1.14.x for this PR is done.
and removed
backport-pending/1.14
The backport for Cilium 1.14.x for this PR is in progress.
labels
Oct 9, 2023
This was referenced Oct 17, 2023
jrajahalme
moved this from Backport pending to v1.14
to Backport done to v1.14
in 1.14.3
Oct 18, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/encryption
Impacts encryption support such as IPSec, WireGuard, or kTLS.
backport-done/1.12
The backport for Cilium 1.12.x for this PR is done.
backport-done/1.13
The backport for Cilium 1.13.x for this PR is done.
backport-done/1.14
The backport for Cilium 1.14.x for this PR is done.
kind/bug
This is a bug in the Cilium logic.
release-blocker/1.12
This issue will prevent the release of the next version of Cilium.
release-blocker/1.13
This issue will prevent the release of the next version of Cilium.
release-blocker/1.14
This issue will prevent the release of the next version of Cilium.
release-note/bug
This PR fixes an issue in a previous release of Cilium.
sig/datapath
Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TL;DR. This pull request works around a Cilium iptables bug that can affect IPsec. The root of the issue for IPsec is a conflict between the packet mark we use and some iptables-nft rules. The workaround changes the packet mark to avoid the conflict.
Cilium's IPsec implementation relies on the packet mark to match packets against XFRM rules. In particular, the packet mark holds the node ID (
0xffff0000
), the SPI (0xf000
), and whether to encrypt or decrypt (0xf00
). On egress, the packet mark is written in bpf_lxc before the packet is sent to the stack for encryption. The encryption layer retains the packet mark, except for the encryption bit (0xf00
) in some cases. bpf_host then processes the encrypted packet and clears the mark.In some corner cases (what causes this isn't clear yet), some OSes can end up with kube-proxy rules in both iptables-nft and iptables-legacy. Cilium is currently unable to handle that situation properly: it should install iptables rules to skip some of kube-proxy's rules, but only does that for iptables-legacy. We can therefore end up with the following rules matching outgoing packets:
We can see that kube-proxy install a KUBE-POSTROUTING chain in the POSTROUTING netfilter chain. That chain skips all packets not marked with
0x4XXX
. Other packets see the mark removed and are then masqueraded. In the above example, 2 packets had the0x4XXX
mark and were masqueraded.These rules can end up matching the mark of our encrypted packets, if the SPI is 4. Encrypted packets are then incorrectly masqueraded. Depending on the node and network configuration, they may end up dropped.
The SPI is incremented on key rotation (from 1 to 15, then going back to 1 again). Users performing key rotations are therefore highly likely to hit this bug if they have iptables-nft kube-proxy rules installed.
We can work around this bug by clearing all packet mark bits we don't need anymore after the encryption. Once the packet is encrypted, bpf_host only needs the
0xf00
part of the mark to determine if a packet is encrypted or not. The node ID and SPI can be cleared from the mark. This pull request therefore clears those bits right after encryption, with output-mark, on egress and ingress.Note this only avoids the IPsec impact of this bug, but doesn't fix the underlying issue with iptables-nft. More work is needed to have the agent correctly handle this situation.