New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove IP filters from initial GC #29696
Remove IP filters from initial GC #29696
Conversation
fd9bffd
to
3e2420e
Compare
@rsafonseca Thanks for the PR! Could you describe in more specific terms the types of entries that were deleted before and how this proposed change addresses that issue? How do we ensure that entries for Pods that have been descheduled during cilium-agent outage are removed? @ti-mo I don't think that we should be backporting this directly to all branches per backport criteria. For the latest release I think it makes sense, but we should also be super careful here because there is a risk of regression if we fail to correctly garbage collect the entries. In the worst case, IPs for stale conntrack entries may be reused and this could have unexpected policy impacts. |
Hi @joestringer, thanks for taking a look! I've noted a short description of some affected flows in the linked issue, the reality is that cleanup is unsafe and has been affecting some of our critical production flows since forever, whenever we need to restart the agent (which could be for various reasons like a config change or simply adjusting cpu/mem requests/limits
Do we really need to ensure that? Any stale entries shouldn't really cause any problems, and will eventually be removed when they expire. It's much worse to actively delete conntrack entries that are in active use, which is what currently happens FWIW, i've already built a custom version on top of v1.14.4 and deployed it to all of our prod clusters and have not observed any problems, and can confirm that this resolves the problem to active connections caused by the agent restarts |
585e4ba
to
d81b2bf
Compare
Commit 2b46c19 does not match "(?m)^Signed-off-by:". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
2b46c19
to
e35206f
Compare
Commit 2b46c19 does not match "(?m)^Signed-off-by:". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
@rsafonseca I appreciate that this has been a pain point for you, and I'm glad that you're investigating these issues and proposing fixes. I agree with the motivation of your change. The difficult part here is that changing anything with conntrack can have various unintended impacts, so we need to make these changes thoughtfully and with a comprehensive understanding on the scope of the changes. If we do not discuss how + why the proposed change targets the undesirable behaviour, then this can cause regressions in other areas. Making such a change without documenting the design clearly can also lead to regressions if someone attempts to bring back some of the behaviour that is being removed here.
I reviewed this possibility and I don't think this is possible, as newly created endpoints set
The intent behind At a high level, I think that the problem you are facing is that there are active entries being deleted. What exactly do those entries look like? Are they service entries (pod->svc tuple) or in/out entries (pod->pod tuple)? Have you considered whether to identify those entries and skip deletion for them during the initial scan? |
The logic in that function seems to be working correctly, but the info passed onto it is incomplete, as it doesn't take into account all sources of potentially active connections.
While this would be in theory possible to implement, it would also make it hugely expensive, as it would need to lookup every address of every pod in the cluster, which would take a lot of time and resources to do, especially in large clusters.
Doubtful, since the time for which the pod is down during a restart/crash/update is fairly small, so any stale entries are highly unlikely to have the potential to have any relevant level of impact on the points above
They look like: I don't really feel that this mechanism is worth it since there's rarely anything to clean after a restart, but even if there is, they'd be eventually GC'd anyway once they expire, or LRU kicks them out If you do feel strongly that this mechanism is currently useful, one inexpensive way that we could implement this to keep a similar behavior and avoid GC'ing external flows accidentally would be to simply look at the CIDR used by the node and only collect GCs which contain a src/dest within the node CIDR and only delete the ones that match that and not the list of restored endpoints. That would GC entries from any pod that might have de-scheduled during the short downtime without touching anything else which doesn't pertain to any pods in the node. |
In which case is this conntrack entry created? Why is the traffic bouncing through a node that doesn't host the target endpoint IP? The most common case for this would be NodePort, but those are exposed on host IPs so I'd expect the conntrack entries to reflect Public IP -> host IP, then with the existing function the host IP filtering should handle the case. So presumably there's a different scenario in your environment.
This case should already be handled by the current logic, unless the host IP detection logic is failing to detect a relevant host IP.
This is an interesting idea, but it relies on each node having a dedicated CIDR. Not all IPAM modes provide this guarantee, so for other IPAM modes there could still be an issue. |
d81b2bf
to
cbe0571
Compare
/test |
Requesting @xmulligan to review for Community since this PR touches the |
FYI I've hit major merge conflicts backporting this to |
Please ensure your pull request adheres to the following guidelines:
description and a
Fixes: #XXX
line if the commit addresses a particularGitHub issue.
Fixes: <commit-id>
tag, thenplease add the commit author[s] as reviewer[s] to this issue.
Fixes: #29667