Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipsec: Fix XfrmOutPolBlock drops on upgrades #25735

Merged
merged 5 commits into from
Jun 2, 2023

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented May 28, 2023

This pull request fixes IPsec drops of type XfrmOutPolBlock that would happen with all IPAM modes during upgrades. The root of the issue is that we would delete the old XFRM OUT policies and/or states instead of leaving them around for the duration of the upgrade.

This pull request makes various changes to ensure that we smoothly transition from the old IPsec implementation to the new. After this pull request, we shouldn't have any packet drops on upgrades when using ENI or Azure IPAM modes. Otherwise (for e.g. cluster-pool IPAM mode), it is theoretically possible to have a few drops of type XfrmOutNoStates; those however haven't been noticed in any of the upgrade tests. There is more information on those XfrmOutNoStates drops in the last commit of this pull request.

How was this tested?

The usual upgrade path from v1.13.0 to this patchset was tested on two clusters of 20 nodes each, on EKS and GKE. Before the upgrade, our drop-sensitive migrate-svc application was deployed with 50 clients and 30 backends. That application sends a continuous stream of messages back and forth and fails whenever a message is dropped.

After the upgrade to this patchset, I verified that:

  • All IPsec error counters were at zero.
  • None of the pods from the migrate-svc application had crashed.
  • The connectivity tests passed on the upgraded cluster.
  • No related errors or warnings were reported in the agent logs.

Fixes: #24773.
Fixes: #24030.

Fix a bug that would cause connectivity drops of type XfrmOutPolBlock on upgrade when IPsec is enabled.

@pchaigno pchaigno added kind/bug This is a bug in the Cilium logic. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/bug This PR fixes an issue in a previous release of Cilium. upgrade-impact This PR has potential upgrade or downgrade impact. area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. integration/cloud Related to integration with cloud environments such as AKS, EKS, GKE, etc. needs-backport/1.11 needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels May 28, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.11.18 May 28, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.13.3 May 28, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.12.10 May 28, 2023
@pchaigno pchaigno force-pushed the ipsec-ippools-upgrade branch 5 times, most recently from ff21f54 to 1db1553 Compare May 30, 2023 17:00
@pchaigno pchaigno changed the title ipsec: Fix XfrmOutPolBlock drops on ESK & AKS upgrades ipsec: Fix XfrmOutPolBlock drops upgrades May 30, 2023
@pchaigno pchaigno marked this pull request as ready for review May 30, 2023 19:19
@pchaigno pchaigno requested review from a team as code owners May 30, 2023 19:19
Copy link
Member

@jschwinger233 jschwinger233 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way to make xfrm states coexist is marvelous!
Just a newbee question: we coexist xfrm states, but update xfrm policies directly, and upgrade test still passes, so where do we use xfrm policy?

@pchaigno
Copy link
Member Author

Just a newbee question: we coexist xfrm states, but update xfrm policies directly, and upgrade test still passes, so where do we use xfrm policy?

Actually, we don't update the XFRM policies directly. In the second commit, we deprioritize the old XFRM policies such that they can coexist with the new ones added as part of the upgrade. I've updated the commit descriptions to include example of how the XFRM configs will change.

If a packet comes into the XFRM subsystem, it will first be matched against the new XFRM policies with marks of the form 0xXXXXYe00/0xffffff00. If that doesn't match because the packet is still using the old mark format (because bpf_lxc hasn't been reloaded yet), then it will match against the deprioritized XFRM policies with marks of the form 0xYe00/0xff00.

Once we've matched against a policy, it will search for the matching XFRM state using the IP addresses from the policy's template. For example, with the below policies, it will look for an XFRM state with src 10.24.1.77 dst 10.24.2.207 that also matches the packet mark.

     src 10.24.1.0/24 dst 10.24.2.0/24
          dir out priority 50
          mark 0x3e00/0xff00
          tmpl src 10.24.1.77 dst 10.24.2.207
              proto esp spi 0x00000003 reqid 1 mode tunnel

@pchaigno pchaigno changed the title ipsec: Fix XfrmOutPolBlock drops upgrades ipsec: Fix XfrmOutPolBlock drops on upgrades May 31, 2023
@tommyp1ckles
Copy link
Contributor

Question: are the prioritized policies just left like this forever - could we not clean them up after some time?

I guess we'd have to know that all nodes have had their host endpoints reloaded before doing that...

(I'm wondering if leaving old policies around might provide other places for issues during future version upgrades

pkg/datapath/linux/ipsec/ipsec_linux.go Show resolved Hide resolved
daemon/cmd/state.go Show resolved Hide resolved
daemon/cmd/state.go Show resolved Hide resolved
pkg/datapath/linux/ipsec/ipsec_linux.go Show resolved Hide resolved
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.13 in 1.13.4 Jun 5, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.13 in 1.13.4 Jun 5, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.13 in 1.13.4 Jun 5, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.12 in 1.12.11 Jun 5, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.11 in 1.11.18 Jun 5, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.11 in 1.11.18 Jun 5, 2023
@qmonnet qmonnet removed this from Backport pending to v1.13 in 1.13.4 Jun 9, 2023
@qmonnet qmonnet removed this from Needs backport from main in 1.13.3 Jun 9, 2023
@qmonnet qmonnet added this to Needs backport from main in 1.13.5 Jun 9, 2023
@qmonnet qmonnet moved this from Needs backport from main to Backport pending to v1.13 in 1.13.5 Jun 9, 2023
@qmonnet qmonnet moved this from Backport pending to v1.13 to Backport done to v1.13 in 1.13.5 Jun 9, 2023
@qmonnet qmonnet removed this from Backport done to v1.13 in 1.13.5 Jun 9, 2023
@qmonnet qmonnet added this to Backport done to v1.13 in 1.13.4 Jun 9, 2023
@pchaigno pchaigno added backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.12 backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Jun 9, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.12 to Backport done to v1.12 in 1.12.11 Jun 9, 2023
@qmonnet qmonnet added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Jun 14, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.11 to Backport done to v1.11 in 1.11.18 Jun 14, 2023
@qmonnet qmonnet removed this from Needs backport from main in 1.12.10 Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. backport/author The backport will be carried out by the author of the PR. backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. integration/cloud Related to integration with cloud environments such as AKS, EKS, GKE, etc. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-blocker/1.11 This issue will prevent the release of the next version of Cilium. release-blocker/1.12 This issue will prevent the release of the next version of Cilium. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. upgrade-impact This PR has potential upgrade or downgrade impact.
Projects
No open projects
1.11.18
Backport done to v1.11
1.12.11
Backport done to v1.12
1.13.4
Backport done to v1.13
Development

Successfully merging this pull request may close these issues.

None yet

5 participants