Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare annotations before discarding CiliumNode updates. #25465

Merged
merged 1 commit into from
May 18, 2023

Conversation

LynneD
Copy link

@LynneD LynneD commented May 15, 2023

Currently we discard CiliumNode updates based on DeepEqual and labels. However DeepEqual is set to ignore Annotations, and the wg-pub-key annotation is used to exchange rotated Wireguard keys.

@LynneD LynneD requested a review from a team as a code owner May 15, 2023 21:21
@LynneD LynneD requested a review from tommyp1ckles May 15, 2023 21:21
@maintainer-s-little-helper
Copy link

Commit 0fc487f01be34f75bf5ae6ad24e2f4b8d6753821 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels May 15, 2023
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label May 15, 2023
@LynneD
Copy link
Author

LynneD commented May 15, 2023

Hi @aojea, this is the fix for Wiregurad tunnel drop issue.

The wireguard tunnel is broken when any node restart happens. After restart, the restarted node's public key got refreshed but not propagated to other nodes.

The issue is caused by cilium dropping CiliumNode update events when the spec, status and labels between the old and new nodes are the same. Wireguard public key updates are transmitted through annotations.

We want to confirm if this is a known issue since it can be reproduced by restarting any node easily.

Thanks!

@@ -69,7 +69,8 @@ func (k *K8sWatcher) ciliumNodeInit(ciliumNPClient client.Clientset, asyncContro
valid = true
isLocal := k8s.IsLocalCiliumNode(ciliumNode)
if oldCN.DeepEqual(ciliumNode) &&
comparator.MapStringEquals(oldCN.ObjectMeta.Labels, ciliumNode.ObjectMeta.Labels) {
comparator.MapStringEquals(oldCN.ObjectMeta.Labels, ciliumNode.ObjectMeta.Labels) &&
comparator.MapStringEquals(oldCN.ObjectMeta.Annotations, ciliumNode.ObjectMeta.Annotations) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes issue in case the wireguard key changes but the rest of the node object not

if pk := n.localNode.WireguardPubKey; pk != "" {
if nodeResource.ObjectMeta.Annotations == nil {
nodeResource.ObjectMeta.Annotations = make(map[string]string)
}
nodeResource.ObjectMeta.Annotations[annotation.WireguardPubKey] = pk
}

since, we don't compare Annotations the event is dropped and not processed

@aojea
Copy link
Contributor

aojea commented May 15, 2023

/assign @aanm @squeed

@LynneD LynneD force-pushed the pr/wireguard-publickey-propagation branch from 0fc487f to d7db202 Compare May 15, 2023 22:56
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label May 15, 2023
@LynneD LynneD force-pushed the pr/wireguard-publickey-propagation branch from d7db202 to b495ea8 Compare May 15, 2023 23:10
@tommyp1ckles
Copy link
Contributor

Thanks for the PR 😄

I wonder if instead of checking all annotation changes, it might make more sense to target the specific wireguard key annotation and just see if that's changed.

It seems like the intention is to avoid unnecessary node updates.

@aojea
Copy link
Contributor

aojea commented May 15, 2023

I wonder if instead of checking all annotation changes, it might make more sense to target the specific wireguard key annotation and just see if that's changed.

in the long term my experience is that new annotations got added but you never remember to update this comparison, it is also coherent with existing comparison with labels

@LynneD LynneD force-pushed the pr/wireguard-publickey-propagation branch from b495ea8 to 52942f6 Compare May 16, 2023 01:04
@LynneD
Copy link
Author

LynneD commented May 16, 2023

I wonder if instead of checking all annotation changes, it might make more sense to target the specific wireguard key annotation and just see if that's changed.

in the long term my experience is that new annotations got added but you never remember to update this comparison, it is also coherent with existing comparison with labels

Hi @tommyp1ckles, I agree with @aojea's point here. Is there a specific concern about checking all annotation changes?

@christarazi
Copy link
Member

/test

@aanm aanm requested a review from gandro May 16, 2023 07:31
@aanm aanm added sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. sig/agent Cilium agent related. release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.12 needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels May 16, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 16, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.12.10 May 16, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.13.3 May 16, 2023
@aanm aanm added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. and removed needs-backport/1.12 labels May 16, 2023
@christarazi
Copy link
Member

christarazi commented May 17, 2023

/test

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sDatapathServicesTest Checks N/S loadbalancing With host policy Tests NodePort

Failure Output

FAIL: Can not connect to service "tftp://192.168.56.12:31104/hello" from outside cluster (1/10)

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/2380/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

Then please upload the Jenkins artifacts to that issue.

@LynneD
Copy link
Author

LynneD commented May 17, 2023

Good finds! Could you add #25465 (comment) to the commit message?

Added the comment. Is the the PR ready to merge? We are waiting for the CL to be merged to cherry-pick it to our internal branch.

A PR is ready to merge when all codeowners have approved the PR and the CI is all green. Right now, Martynas is required to approve (I've re-requested him) and I will trigger the CI.

Thanks @christarazi

Copy link
Contributor

@tommyp1ckles tommyp1ckles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @aojea's point here. Is there a specific concern about checking all annotation changes?

Nothing specific - mostly just wondering if it would lead to unnecessary updates. This seems reasonable to me.

@LynneD
Copy link
Author

LynneD commented May 18, 2023

Good finds! Could you add #25465 (comment) to the commit message?

Hi @brb, I've updated the commit message, can you please take another look?

Copy link
Member

@brb brb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@LynneD
Copy link
Author

LynneD commented May 18, 2023

/mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next

@LynneD
Copy link
Author

LynneD commented May 18, 2023

Thanks @christarazi @brb @tommyp1ckles @aojea, I've got enough approvals, and the "k8s-1.26-kernel-net-next" test seems not related to this PR. Can you please take another look and see how I can unblock the merging?

@LynneD LynneD force-pushed the pr/wireguard-publickey-propagation branch from 1d614f3 to d38b6b6 Compare May 18, 2023 06:35
Currently we discard CiliumNode updates based on DeepEqual and labels. However DeepEqual is set to ignore Annotations, and the wg-pub-key annotation is used to exchange rotated Wireguard keys.

The wireguard tunnel is broken when any node restart happens. After restart, the restarted node's public key got refreshed but not propagated to other nodes.

The issue is caused by cilium dropping CiliumNode update events when the spec, status and labels between the old and new nodes are the same. Wireguard public key updates are transmitted through annotations.

Signed-off-by: Lin Dong <lindongchn@gmail.com>
@christarazi christarazi force-pushed the pr/wireguard-publickey-propagation branch from d38b6b6 to 975a0d4 Compare May 18, 2023 18:54
@christarazi
Copy link
Member

/test

@christarazi
Copy link
Member

I've rebased to re-run the tests. Please don't push unless the CI points out a failure and you need to push to address it.

@LynneD
Copy link
Author

LynneD commented May 18, 2023

I've rebased to re-run the tests. Please don't push unless the CI points out a failure and you need to push to address it.

Got it. Thanks @christarazi!

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 18, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.13.3 May 18, 2023
@christarazi christarazi merged commit 76eca78 into cilium:main May 18, 2023
58 checks passed
@christarazi
Copy link
Member

Thanks for the PR and your patience!

@LynneD
Copy link
Author

LynneD commented May 18, 2023

Thanks so much @christarazi!

@tklauser tklauser mentioned this pull request May 22, 2023
4 tasks
@tklauser tklauser added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels May 22, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.13 in 1.13.3 May 22, 2023
@tklauser tklauser added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels May 26, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.13 to Backport done to v1.13 in 1.13.3 May 26, 2023
@qmonnet qmonnet moved this from Needs backport from main to Backport done to v1.13 in 1.13.4 Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.10 This issue affects v1.10 branch affects/v1.11 This issue affects v1.11 branch affects/v1.12 This issue affects v1.12 branch affects/v1.13 This issue affects v1.13 branch area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/agent Cilium agent related. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers.
Projects
No open projects
1.13.3
Backport done to v1.13
1.13.4
Backport done to v1.13
Development

Successfully merging this pull request may close these issues.

None yet

9 participants