Remove IP filters from initial GC #29696

rsafonseca · 2023-12-07T13:47:43Z

Please ensure your pull request adheres to the following guidelines:

For first time contributors, read Submitting a pull request
All code is covered by unit and/or runtime tests where feasible.
All commits contain a well written commit description including a title,
description and a Fixes: #XXX line if the commit addresses a particular
GitHub issue.
If your commit description contains a Fixes: <commit-id> tag, then
please add the commit author[s] as reviewer[s] to this issue.
All commits are signed off. See the section Developer’s Certificate of Origin
Provide a title or release-note blurb suitable for the release notes.
Are you a user of Cilium? Please add yourself to the Users doc
Thanks for contributing!

Fixes some valid GC entries being removed at agent restart

joestringer · 2023-12-07T16:59:44Z

@rsafonseca Thanks for the PR! Could you describe in more specific terms the types of entries that were deleted before and how this proposed change addresses that issue?

How do we ensure that entries for Pods that have been descheduled during cilium-agent outage are removed?

@ti-mo I don't think that we should be backporting this directly to all branches per backport criteria. For the latest release I think it makes sense, but we should also be super careful here because there is a risk of regression if we fail to correctly garbage collect the entries. In the worst case, IPs for stale conntrack entries may be reused and this could have unexpected policy impacts.

rsafonseca · 2023-12-07T17:08:19Z

Hi @joestringer, thanks for taking a look!

I've noted a short description of some affected flows in the linked issue, the reality is that cleanup is unsafe and has been affecting some of our critical production flows since forever, whenever we need to restart the agent (which could be for various reasons like a config change or simply adjusting cpu/mem requests/limits

How do we ensure that entries for Pods that have been descheduled during cilium-agent outage are removed?

Do we really need to ensure that? Any stale entries shouldn't really cause any problems, and will eventually be removed when they expire. It's much worse to actively delete conntrack entries that are in active use, which is what currently happens

FWIW, i've already built a custom version on top of v1.14.4 and deployed it to all of our prod clusters and have not observed any problems, and can confirm that this resolves the problem to active connections caused by the agent restarts

maintainer-s-little-helper · 2023-12-07T17:42:56Z

Commit 2b46c19 does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

maintainer-s-little-helper · 2023-12-07T17:43:47Z

Commit 2b46c19 does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

joestringer · 2023-12-07T19:46:02Z

@rsafonseca I appreciate that this has been a pain point for you, and I'm glad that you're investigating these issues and proposing fixes. I agree with the motivation of your change.

The difficult part here is that changing anything with conntrack can have various unintended impacts, so we need to make these changes thoughtfully and with a comprehensive understanding on the scope of the changes. If we do not discuss how + why the proposed change targets the undesirable behaviour, then this can cause regressions in other areas. Making such a change without documenting the design clearly can also lead to regressions if someone attempts to bring back some of the behaviour that is being removed here.

In the worst case, IPs for stale conntrack entries may be reused and this could have unexpected policy impacts.

I reviewed this possibility and I don't think this is possible, as newly created endpoints set e.ctCleaned to false and runPreCompilationSteps() will perform e.scrubIPsInConntrackTable to scrub any entries for the same IP that may still exist in the conntrack table during endpoint creation.

How do we ensure that entries for Pods that have been descheduled during cilium-agent outage are removed?

Do we really need to ensure that? Any stale entries shouldn't really cause any problems, and will eventually be removed when they expire. It's much worse to actively delete conntrack entries that are in active use, which is what currently happens

The intent behind createGCFilter() is to determine which entries correspond to active IPs and only delete entries that are for endpoints that no longer exist. If the logic inside that filter was working correctly, then you should not experience this bug. Evidently, the logic is not accurate. I agree that the current problem of being overzealous about deleting entries is creating a higher impact than keeping the entries, but that doesn't mean that the solution must ignore which IPs are active or not. I think the solution should consider exactly which entries should be kept and which entries should remain, as that keeps the content of the conntrack as small as possible following startup. This can have impacts on bpf map pressure, CPU utilization to iterate conntrack, and potentially even endpoint creation latency.

At a high level, I think that the problem you are facing is that there are active entries being deleted. What exactly do those entries look like? Are they service entries (pod->svc tuple) or in/out entries (pod->pod tuple)? Have you considered whether to identify those entries and skip deletion for them during the initial scan?

rsafonseca · 2023-12-07T20:04:04Z

The intent behind createGCFilter() is to determine which entries correspond to active IPs and only delete entries that are for endpoints that no longer exist. If the logic inside that filter was working correctly, then you should not experience this bug.

The logic in that function seems to be working correctly, but the info passed onto it is incomplete, as it doesn't take into account all sources of potentially active connections.

I agree that the current problem of being overzealous about deleting entries is creating a higher impact than keeping the entries, but that doesn't mean that the solution must ignore which IPs are active or not. I think the solution should consider exactly which entries should be kept and which entries should remain, as that keeps the content of the conntrack as small as possible following startup.

While this would be in theory possible to implement, it would also make it hugely expensive, as it would need to lookup every address of every pod in the cluster, which would take a lot of time and resources to do, especially in large clusters.

This can have impacts on bpf map pressure, CPU utilization to iterate conntrack, and potentially even endpoint creation latency.

Doubtful, since the time for which the pod is down during a restart/crash/update is fairly small, so any stale entries are highly unlikely to have the potential to have any relevant level of impact on the points above

At a high level, I think that the problem you are facing is that there are active entries being deleted. What exactly do those entries look like?

They look like:
Public IP -> Endpoint IP of a pod in another node
or
Host IP -> Endpoint IP of a pod in another node

I don't really feel that this mechanism is worth it since there's rarely anything to clean after a restart, but even if there is, they'd be eventually GC'd anyway once they expire, or LRU kicks them out

If you do feel strongly that this mechanism is currently useful, one inexpensive way that we could implement this to keep a similar behavior and avoid GC'ing external flows accidentally would be to simply look at the CIDR used by the node and only collect GCs which contain a src/dest within the node CIDR and only delete the ones that match that and not the list of restored endpoints. That would GC entries from any pod that might have de-scheduled during the short downtime without touching anything else which doesn't pertain to any pods in the node.

joestringer · 2023-12-07T22:37:05Z

They look like:
Public IP -> Endpoint IP of a pod in another node

In which case is this conntrack entry created? Why is the traffic bouncing through a node that doesn't host the target endpoint IP? The most common case for this would be NodePort, but those are exposed on host IPs so I'd expect the conntrack entries to reflect Public IP -> host IP, then with the existing function the host IP filtering should handle the case. So presumably there's a different scenario in your environment.

Host IP -> Endpoint IP of a pod in another node

This case should already be handled by the current logic, unless the host IP detection logic is failing to detect a relevant host IP.

If you do feel strongly that this mechanism is currently useful, one inexpensive way that we could implement this to keep a similar behavior and avoid GC'ing external flows accidentally would be to simply look at the CIDR used by the node and only collect GCs which contain a src/dest within the node CIDR and only delete the ones that match that and not the list of restored endpoints. That would GC entries from any pod that might have de-scheduled during the short downtime without touching anything else which doesn't pertain to any pods in the node.

This is an interesting idea, but it relies on each node having a dedicated CIDR. Not all IPAM modes provide this guarantee, so for other IPAM modes there could still be an issue.

joestringer · 2024-02-12T23:37:44Z

/test

joestringer · 2024-02-12T23:40:32Z

Requesting @xmulligan to review for Community since this PR touches the USERS.md file. Not sure why GitHub didn't autoselect someone to review on behalf of that team.

tklauser · 2024-02-20T10:41:38Z

FYI I've hit major merge conflicts backporting this to v1.14 in #30864, thus skipped backporting to 1.14 for now.

rsafonseca requested review from a team as code owners December 7, 2023 13:47

rsafonseca requested a review from ldelossa December 7, 2023 13:47

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 7, 2023

github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Dec 7, 2023

rsafonseca force-pushed the remove_ip_filter_initial_gc branch from fd9bffd to 3e2420e Compare December 7, 2023 13:59

ti-mo added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Dec 7, 2023

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 7, 2023

ti-mo added feature/conntrack needs-backport/1.12 needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Dec 7, 2023

maintainer-s-little-helper bot added this to Needs backport from main in 1.14.5 Dec 7, 2023

maintainer-s-little-helper bot added this to Needs backport from main in 1.13.10 Dec 7, 2023

maintainer-s-little-helper bot added this to Needs backport from main in 1.12.17 Dec 7, 2023

joestringer removed needs-backport/1.12 needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Dec 7, 2023

maintainer-s-little-helper bot removed this from Needs backport from main in 1.13.10 Dec 7, 2023

maintainer-s-little-helper bot removed this from Needs backport from main in 1.12.17 Dec 7, 2023

rsafonseca force-pushed the remove_ip_filter_initial_gc branch from 585e4ba to d81b2bf Compare December 7, 2023 17:14

maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Dec 7, 2023

rsafonseca force-pushed the remove_ip_filter_initial_gc branch from 2b46c19 to e35206f Compare December 7, 2023 17:43

joestringer force-pushed the remove_ip_filter_initial_gc branch from d81b2bf to cbe0571 Compare February 12, 2024 23:36

joestringer requested a review from xmulligan February 12, 2024 23:39

michi-covalent added this to Needs backport from main in 1.14.8 Feb 13, 2024

michi-covalent removed this from Needs backport from main in 1.14.7 Feb 13, 2024

michi-covalent added this to Needs backport from main in 1.15.2 Feb 14, 2024

michi-covalent removed this from Needs backport from main in 1.15.1 Feb 14, 2024

joestringer removed the dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. label Feb 17, 2024

maintainer-s-little-helper bot added this to Needs backport from main in 1.14.7 Feb 17, 2024

joestringer enabled auto-merge February 17, 2024 21:29

borkmann approved these changes Feb 20, 2024

View reviewed changes

borkmann removed the request for review from ldelossa February 20, 2024 09:57

xmulligan approved these changes Feb 20, 2024

View reviewed changes

joestringer added this pull request to the merge queue Feb 20, 2024

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 20, 2024

Merged via the queue into cilium:main with commit 5276ce4 Feb 20, 2024
62 checks passed

tklauser mentioned this pull request Feb 20, 2024

v1.15 Backports 2024-02-20 #30863

Merged

7 tasks

tklauser added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Feb 20, 2024

tklauser mentioned this pull request Feb 20, 2024

v1.14 Backports 2024-02-20 #30864

Merged

6 tasks

tklauser removed the needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch label Feb 21, 2024

maintainer-s-little-helper bot removed this from Needs backport from main in 1.14.7 Feb 21, 2024

github-actions bot added backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Feb 21, 2024

tklauser moved this from Needs backport from main to Backport done to v1.15 in 1.15.2 Feb 21, 2024

julianwiedmann added the affects/v1.14 This issue affects v1.14 branch label Feb 26, 2024

jrajahalme mentioned this pull request Mar 13, 2024

Prepare for release v1.15.2 #31371

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove IP filters from initial GC #29696

Remove IP filters from initial GC #29696

rsafonseca commented Dec 7, 2023 •

edited by tklauser

joestringer commented Dec 7, 2023

rsafonseca commented Dec 7, 2023 •

edited

maintainer-s-little-helper bot commented Dec 7, 2023

maintainer-s-little-helper bot commented Dec 7, 2023

joestringer commented Dec 7, 2023

rsafonseca commented Dec 7, 2023

joestringer commented Dec 7, 2023

joestringer commented Feb 12, 2024

joestringer commented Feb 12, 2024

tklauser commented Feb 20, 2024 •

edited

Remove IP filters from initial GC #29696

Remove IP filters from initial GC #29696

Conversation

rsafonseca commented Dec 7, 2023 • edited by tklauser

joestringer commented Dec 7, 2023

rsafonseca commented Dec 7, 2023 • edited

maintainer-s-little-helper bot commented Dec 7, 2023

maintainer-s-little-helper bot commented Dec 7, 2023

joestringer commented Dec 7, 2023

rsafonseca commented Dec 7, 2023

joestringer commented Dec 7, 2023

joestringer commented Feb 12, 2024

joestringer commented Feb 12, 2024

tklauser commented Feb 20, 2024 • edited

rsafonseca commented Dec 7, 2023 •

edited by tklauser

rsafonseca commented Dec 7, 2023 •

edited

tklauser commented Feb 20, 2024 •

edited