-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[backport-v1.13] ipcache: don't short-circuit InjectLabels if source differs #25077
[backport-v1.13] ipcache: don't short-circuit InjectLabels if source differs #25077
Conversation
[ upstream commit f52b4c9 ] This makes the mock allocator work similarly to the "real" one: if an ID is requested and it is not in use, then accept it. Signed-off-by: Casey Callendrello <cdc@isovalent.com>
This code has been tested by a few end-users via a hotfix branch, and they reported that it fixed the bug in question. |
/test-backport-1.13 Job 'Cilium-PR-K8s-1.24-kernel-5.4' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.24-kernel-5.4/1837/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.25-kernel-4.19' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.25-kernel-4.19/1851/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.16-kernel-4.19' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-kernel-4.19/1087/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.21-kernel-4.19' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.21-kernel-4.19/13/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.23-kernel-4.19' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.23-kernel-4.19/737/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.22-kernel-4.19' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.22-kernel-4.19/1632/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.18-kernel-4.19' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.18-kernel-4.19/13/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.20-kernel-4.19' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.20-kernel-4.19/1906/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.19-kernel-4.19' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.19-kernel-4.19/12/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.24-kernel-4.19' has 1 failure but they might be new flake since it also hit 1 known flake: #20723 (87.00% similarity) Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/1914/ If it is a flake and a GitHub issue doesn't already exist to track it, comment |
Please use the standard scripts for backporting. The PR description is missing the |
Backport commits look identical to upstream which is good, but CI is failing in etcd cases due to this concerning log message:
It looks like there was some other internal change between |
I mostly did, except for forgetting to do |
This is outputting warning messages in the etcd check, I'm investigating. We had this issue in |
[ upstream commit 8d3a498 ] InjectLabels is one of the functions responsible for synchronizing the ipcache metadata store and ip store. As such, it shouldn't shortcut when the numeric identity is the same, but the source is different; this means that an update to the ipcache isn't complete. This can happen, for example, when there are two identities for the same IP, which can happen on daemon restart whenever a CIDR is referenced. Fixes: cilium#24502 Signed-off-by: Casey Callendrello <cdc@isovalent.com>
Aha, found the issue - the semantics of
|
03d3e30
to
95aaca8
Compare
It should be enough to just update the PR description as per the guide here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest diff LGTM.
/test-backport-1.13 Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/1921/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Then please upload the Jenkins artifacts to that issue. |
PR description fixed, sorry for the oversight. |
|
/test-1.17-4.19 |
This code doesn't directly touch egress gateways, though I make no promises that they're unrelated. I'm investigating. |
marking as do-not-merge and triggering test again, I want to run the test again and get some more signal |
/test-1.26-net-next Result: vagrant failure to boot. |
/test-1.26-net-next Result: Failed: |
/test-only --focus="K8sDatapathEgressGatewayTest" --kernel_version=net-next --k8s_version=1.26 Result: Passed. That's a good sign. |
Dug in to the failed sysdump with @jibi, and we found that, for some unknown reason, a pod was missing from the set of pods to which egress policy applied:
should have 6 lines, including two for 10.0.1.140. It doesn't. There are no error messages, and the datapath is directly from k8s -> egressgateway/manager.go; no detours through the ipcache or identity allocator. At this point, I'm fairly convinced my PR didn't cause this:
|
Doing a quick scan of the Jenkins logs, it seems like this test has been failing somewhat frequently. I don't see a flake issue, but I we can safely say this is a flake. |
At this point, I think this PR is OK to merge. The failing tests:
|
Backporting
InjectLabels is one of the functions responsible for synchronizing the ipcache metadata store and ip store. As such, it shouldn't shortcut when the numeric identity is the same, but the source is different; this means that an update to the ipcache isn't complete.
This can happen, for example, when there are two identities for the same IP, which can happen on daemon restart whenever a CIDR is referenced.
Signed-off-by: Casey Callendrello cdc@isovalent.com
Once this PR is merged, you can update the PR labels via: