Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove IP filters from initial GC #29696

Merged
merged 2 commits into from Feb 20, 2024

Conversation

rsafonseca
Copy link
Contributor

@rsafonseca rsafonseca commented Dec 7, 2023

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!

Fixes: #29667

Fixes some valid GC entries being removed at agent restart

@rsafonseca rsafonseca requested review from a team as code owners December 7, 2023 13:47
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 7, 2023
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Dec 7, 2023
@ti-mo ti-mo added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Dec 7, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 7, 2023
@ti-mo ti-mo added feature/conntrack needs-backport/1.12 needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Dec 7, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.14.5 Dec 7, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.13.10 Dec 7, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.12.17 Dec 7, 2023
@joestringer
Copy link
Member

@rsafonseca Thanks for the PR! Could you describe in more specific terms the types of entries that were deleted before and how this proposed change addresses that issue?

How do we ensure that entries for Pods that have been descheduled during cilium-agent outage are removed?

@ti-mo I don't think that we should be backporting this directly to all branches per backport criteria. For the latest release I think it makes sense, but we should also be super careful here because there is a risk of regression if we fail to correctly garbage collect the entries. In the worst case, IPs for stale conntrack entries may be reused and this could have unexpected policy impacts.

@joestringer joestringer removed needs-backport/1.12 needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Dec 7, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Needs backport from main in 1.13.10 Dec 7, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Needs backport from main in 1.12.17 Dec 7, 2023
@rsafonseca
Copy link
Contributor Author

rsafonseca commented Dec 7, 2023

Hi @joestringer, thanks for taking a look!

I've noted a short description of some affected flows in the linked issue, the reality is that cleanup is unsafe and has been affecting some of our critical production flows since forever, whenever we need to restart the agent (which could be for various reasons like a config change or simply adjusting cpu/mem requests/limits

How do we ensure that entries for Pods that have been descheduled during cilium-agent outage are removed?

Do we really need to ensure that? Any stale entries shouldn't really cause any problems, and will eventually be removed when they expire. It's much worse to actively delete conntrack entries that are in active use, which is what currently happens

FWIW, i've already built a custom version on top of v1.14.4 and deployed it to all of our prod clusters and have not observed any problems, and can confirm that this resolves the problem to active connections caused by the agent restarts

@maintainer-s-little-helper
Copy link

Commit 2b46c19 does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Dec 7, 2023
@maintainer-s-little-helper
Copy link

Commit 2b46c19 does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@joestringer
Copy link
Member

@rsafonseca I appreciate that this has been a pain point for you, and I'm glad that you're investigating these issues and proposing fixes. I agree with the motivation of your change.

The difficult part here is that changing anything with conntrack can have various unintended impacts, so we need to make these changes thoughtfully and with a comprehensive understanding on the scope of the changes. If we do not discuss how + why the proposed change targets the undesirable behaviour, then this can cause regressions in other areas. Making such a change without documenting the design clearly can also lead to regressions if someone attempts to bring back some of the behaviour that is being removed here.

In the worst case, IPs for stale conntrack entries may be reused and this could have unexpected policy impacts.

I reviewed this possibility and I don't think this is possible, as newly created endpoints set e.ctCleaned to false and runPreCompilationSteps() will perform e.scrubIPsInConntrackTable to scrub any entries for the same IP that may still exist in the conntrack table during endpoint creation.

How do we ensure that entries for Pods that have been descheduled during cilium-agent outage are removed?

Do we really need to ensure that? Any stale entries shouldn't really cause any problems, and will eventually be removed when they expire. It's much worse to actively delete conntrack entries that are in active use, which is what currently happens

The intent behind createGCFilter() is to determine which entries correspond to active IPs and only delete entries that are for endpoints that no longer exist. If the logic inside that filter was working correctly, then you should not experience this bug. Evidently, the logic is not accurate. I agree that the current problem of being overzealous about deleting entries is creating a higher impact than keeping the entries, but that doesn't mean that the solution must ignore which IPs are active or not. I think the solution should consider exactly which entries should be kept and which entries should remain, as that keeps the content of the conntrack as small as possible following startup. This can have impacts on bpf map pressure, CPU utilization to iterate conntrack, and potentially even endpoint creation latency.

At a high level, I think that the problem you are facing is that there are active entries being deleted. What exactly do those entries look like? Are they service entries (pod->svc tuple) or in/out entries (pod->pod tuple)? Have you considered whether to identify those entries and skip deletion for them during the initial scan?

@rsafonseca
Copy link
Contributor Author

The intent behind createGCFilter() is to determine which entries correspond to active IPs and only delete entries that are for endpoints that no longer exist. If the logic inside that filter was working correctly, then you should not experience this bug.

The logic in that function seems to be working correctly, but the info passed onto it is incomplete, as it doesn't take into account all sources of potentially active connections.

I agree that the current problem of being overzealous about deleting entries is creating a higher impact than keeping the entries, but that doesn't mean that the solution must ignore which IPs are active or not. I think the solution should consider exactly which entries should be kept and which entries should remain, as that keeps the content of the conntrack as small as possible following startup.

While this would be in theory possible to implement, it would also make it hugely expensive, as it would need to lookup every address of every pod in the cluster, which would take a lot of time and resources to do, especially in large clusters.

This can have impacts on bpf map pressure, CPU utilization to iterate conntrack, and potentially even endpoint creation latency.

Doubtful, since the time for which the pod is down during a restart/crash/update is fairly small, so any stale entries are highly unlikely to have the potential to have any relevant level of impact on the points above

At a high level, I think that the problem you are facing is that there are active entries being deleted. What exactly do those entries look like?

They look like:
Public IP -> Endpoint IP of a pod in another node
or
Host IP -> Endpoint IP of a pod in another node

I don't really feel that this mechanism is worth it since there's rarely anything to clean after a restart, but even if there is, they'd be eventually GC'd anyway once they expire, or LRU kicks them out

If you do feel strongly that this mechanism is currently useful, one inexpensive way that we could implement this to keep a similar behavior and avoid GC'ing external flows accidentally would be to simply look at the CIDR used by the node and only collect GCs which contain a src/dest within the node CIDR and only delete the ones that match that and not the list of restored endpoints. That would GC entries from any pod that might have de-scheduled during the short downtime without touching anything else which doesn't pertain to any pods in the node.

@joestringer
Copy link
Member

They look like:
Public IP -> Endpoint IP of a pod in another node

In which case is this conntrack entry created? Why is the traffic bouncing through a node that doesn't host the target endpoint IP? The most common case for this would be NodePort, but those are exposed on host IPs so I'd expect the conntrack entries to reflect Public IP -> host IP, then with the existing function the host IP filtering should handle the case. So presumably there's a different scenario in your environment.

Host IP -> Endpoint IP of a pod in another node

This case should already be handled by the current logic, unless the host IP detection logic is failing to detect a relevant host IP.

If you do feel strongly that this mechanism is currently useful, one inexpensive way that we could implement this to keep a similar behavior and avoid GC'ing external flows accidentally would be to simply look at the CIDR used by the node and only collect GCs which contain a src/dest within the node CIDR and only delete the ones that match that and not the list of restored endpoints. That would GC entries from any pod that might have de-scheduled during the short downtime without touching anything else which doesn't pertain to any pods in the node.

This is an interesting idea, but it relies on each node having a dedicated CIDR. Not all IPAM modes provide this guarantee, so for other IPAM modes there could still be an issue.

@joestringer
Copy link
Member

/test

@joestringer
Copy link
Member

Requesting @xmulligan to review for Community since this PR touches the USERS.md file. Not sure why GitHub didn't autoselect someone to review on behalf of that team.

@michi-covalent michi-covalent added this to Needs backport from main in 1.14.8 Feb 13, 2024
@michi-covalent michi-covalent removed this from Needs backport from main in 1.14.7 Feb 13, 2024
@michi-covalent michi-covalent added this to Needs backport from main in 1.15.2 Feb 14, 2024
@michi-covalent michi-covalent removed this from Needs backport from main in 1.15.1 Feb 14, 2024
@joestringer joestringer removed the dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. label Feb 17, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.14.7 Feb 17, 2024
@borkmann borkmann removed the request for review from ldelossa February 20, 2024 09:57
@joestringer joestringer added this pull request to the merge queue Feb 20, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 20, 2024
Merged via the queue into cilium:main with commit 5276ce4 Feb 20, 2024
62 checks passed
@tklauser tklauser mentioned this pull request Feb 20, 2024
7 tasks
@tklauser tklauser added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Feb 20, 2024
@tklauser tklauser mentioned this pull request Feb 20, 2024
6 tasks
@tklauser
Copy link
Member

tklauser commented Feb 20, 2024

FYI I've hit major merge conflicts backporting this to v1.14 in #30864, thus skipped backporting to 1.14 for now.

@tklauser tklauser removed the needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch label Feb 21, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Needs backport from main in 1.14.7 Feb 21, 2024
@github-actions github-actions bot added backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Feb 21, 2024
@tklauser tklauser moved this from Needs backport from main to Backport done to v1.15 in 1.15.2 Feb 21, 2024
@julianwiedmann julianwiedmann added the affects/v1.14 This issue affects v1.14 branch label Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.14 This issue affects v1.14 branch backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. feature/conntrack kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.14.8
Needs backport from main
1.15.2
Backport done to v1.15
10 participants