-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.13 Backports 2023-05-02 #25223
v1.13 Backports 2023-05-02 #25223
Conversation
[ upstream commit 32473b9 ] Log number of backends that are identified as leaked, and hence, skipped and deleted during restore. This is useful for debugging when debug logs are not enabled. Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 662c96b ] Log orphan backends that are not associated with any services, and cleaned up post agent restart. This can be useful for debugging when debug logs are not enabled. Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit c32b001 ] During backends restore, the agent previously released backend ID in case of conflict. More importantly, the ID that was released was already restored for another backend. This will release conflicted IDs from the ID allocator, while still leaving backend entries in the backends map. The ID allocator can then assign the released ID to another backend once the restore is complete. This commit removes releasing conflicted backend IDs. The proper clean-up would happen in certain cases when backends are deleted as duplicate or orphan where they are not associated with any of the services. Fixes: 51b62f5 ("service: Add methods to acquire backend ID for svc v2") Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit ec8d3ce ] This commit revises the fix made in commit 5311f81 that handles leaked backends. Consider the following leaked backend scenarios where (1) and (2a) were previously handled: (1) Backend entries leaked, no duplicates: previously, such backends would be deleted as orphan backends after sync with kubernetes api server. (2) Backend entries leaked with duplicates: (a) backend with overlapping L3nL4Addr hash is associated with service(s) Sequence of events: Backends were leaked prior to agent restart, but there was at least one service that the backend by hash is associated with. s.backendByHash will have a non-zero reference count for the overlapping L3nL4Addr hash. (b) none of the backends are associated with services Sequence of events: All the services these backends were associated with were deleted prior to agent restart. s.backendByHash will not have an entry for the backends hash. This commit updates the logic to handle all the scenarios in one place. Fixes: b79a4a5 (pkg/service: Gracefully terminate service backends) Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit e1a48bb ] The commit adds a unit test case to validate various leaked backends scenarios handled in restore services. Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit b0ada08 ] The commit documents the known service backends leak issue along with the cilium versions that have the fixes. Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Jussi Maki <jussi@isovalent.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for my changes ✅
@gentoo-root skipped your PR as the egress SNAT tests didn't apply cleanly due to #24849 being only partially backported. Might be trivial to fix, but wasn't sure. Could you take a look at backporting it separately? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My changes LGTM, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @joamaki my patch LGTM.
@@ -889,7 +889,7 @@ hubble: | |||
- sourceLabels: | |||
- __meta_kubernetes_pod_node_name | |||
targetLabel: node | |||
replacement: ${1} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something probably broken on my local setup as this differs from what CI comes up with. Will try to figure out what's up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
envsubst -no-digit
fixes it. Seems like something changed in newer versions of envsubst
and CI was probably using a very old version. I'll look into this issue in a separate PR and see what we can do about it.
[ upstream commit 9e83a6f ] Cilium is currently affected by a known bug (cilium#24692) when NodePorts are handled by the KPR implementation, which occurs when the same NodePort is used both in the local and the remote cluster. This causes all traffic targeting that NodePort to be redirected to a local backend, regardless of whether the destination node belongs to the local or the remote cluster. This affects also the clustermesh-apiserver NodePort service, which is configured by default with a fixed port. Hence, let's add a warning message to the corresponding values file setting. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 11e1bcc ] This is to add a small docs for version matrix between Cilium and Cilium envoy versions, which is useful with the upcoming work to move envoy proxy out of Cilium agent container. Co-authored-by: ZSC <zacharysarah@users.noreply.github.com> Signed-off-by: Tam Mach <sayboras@yahoo.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 371a0e0 ] 86aa873 ("controller: add new ControllerParam CancelDoFuncOnUpdate") introduced the possibility of canceling the context of the controller's DoFunc when the controller is updated, which is then reset. There's an extremely rare chance that this context is accessed while being overwritten. Hence, let's grab a local copy while holding the mutex. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 6473311 ] The Controller implementation currently supports the possibility of specifying a context among the parameters, which is propagated to the DoFunc. Though, once the context is canceled, the DoFunc is likely to never succeed again (unless it ignores it). Hence, let's pause the controller until it is either updated (refreshing the context) or properly stopped. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 1ca6185 ] f487647 ("operator: Wait for informers to shut down when stopping") introduced a WaitGroup to wait for the shutdown of all goroutines started by `legacy.onStart`. When running in kvstore mode, the operator spawns one goroutine to synchronize global services to the kvstore, continuously reading from a channel. Since that channel is never closed, the goroutine does not terminate, preventing the correct shutdown of the operator. This commit fixes it propagating the context, and stopping the goroutine when it is canceled. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit db3e015 ] Before this patch, we would hit a controller-gen[1] bug when the temporary file would be of the form tmp.0oXXXXXX. This patch uses a custom mktemp template that will not trigger the bug. [1]: kubernetes-sigs/controller-tools#734 Signed-off-by: Alexandre Perrin <alex@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 9a38aec ] We've been distributing ARM architecture images for Cilium for almost two years, but neglected to mention this up front in the system requirements or the main docs page. Add this to the docs. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 4cd06e4 ] Cilium follows major.minor.patch version scheme. The section on ReadMe telling about the end of support is incorrect. This PR is fixing it. Signed-off-by: ayesha khaliq <ayeshakhaliqrana@gmail.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit eff26cf ] We have observed, that the same packet can be handled multiple times by the bpf_host's to-netdev. This can happen when the to-netdev is attached to a bridge and to an outgoing netdev which is attached to the bridge. This can result e.g., into multiple unnecessary SNATs for the same packet which can break the host-firewall (the host-firewall for a reply to such a packet is invoked after only the first rev-SNAT). To fix that particular case set the SNAT done flag (an SKB mark). Tested manually. On kind-worker node (172.18.0.3): ip link add br0 type bridge ip addr add 172.18.0.3/16 dev br0 ip addr del 172.18.0.3/16 dev eth0 ip link set dev br0 up ip link set dev eth0 master br0 ip route replace 0.0.0.0/0 via 172.18.0.1 Make sure that Cilium attached to both: cilium status | grep KubeProxy KubeProxyReplacement: Strict [br0 172.18.0.3, eth0 172.18.0.3] Then start any pod on kind-worker, and run tcpdump on kind-work. curl 1.1.1.1 from that pod. tcpdump -i any -n 'dst port 80' lxc2663158aafe1 In IP 10.244.1.161.43710 > 1.1.1.1.80: Flags [S] br0 Out IP 172.18.0.3.43710 > 1.1.1.1.80: Flags [S] eth0 Out IP 172.18.0.3.43710 > 1.1.1.1.80: Flags [S] Redo the same test w/o the fix. The relevant tcpdump output: tcpdump -i any -n 'dst host 1.1.1.1' lxcbce3c2f349e1 In IP 10.244.1.120.57408 > 1.1.1.1.80: Flags [S] br0 Out IP 172.18.0.3.57408 > 1.1.1.1.80: Flags [S] eth0 Out IP 172.18.0.3.56763 > 1.1.1.1.80: Flags [S] <-- 2nd SNAT! Suggested-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Jussi Maki <jussi@isovalent.com>
d3cdcb5
to
57091a3
Compare
/test-backport-1.13 Job 'Cilium-PR-K8s-1.20-kernel-4.19' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.20-kernel-4.19/1920/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Then please upload the Jenkins artifacts to that issue. Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/2029/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Then please upload the Jenkins artifacts to that issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My commits look good. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For #25174: 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for my commits. Thanks!
/test-1.20-4.19 |
/test-1.26-net-next |
PRs skipped due to conflicts:
Once this PR is merged, you can update the PR labels via:
or with