New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable caching for Helm Operator when watching all namespaces to avoid OOMKilled in big clusters #6255
Comments
So, we are aware of some memory issues in the Helm operator. We actually just merged a pull request that might help with this problem. We should be cutting a release this week or so, could you try rebuilding your operator with the new version and see if that helps? If not we can trying adding something like this. |
@jberkhahn @joelanford Thanks for the follow up. Will try the latest PR. Any info on the possibility of disabling manager cache with helm-operator? |
Another thought: when we create dynamic watches for the operand objects, do we use a label selector to ensure we only watch and cache the subset of those objects that are actually managed by the operator? I'm guessing we don't and that if we did, it would be a significant improvement, especially on larger clusters. |
@jberkhahn Adding some links here in case it helps during investigation:
Place where we set up reconciler for each API defined in watches.yaml:
This is where we add watches for resources in released Helm charts:
|
so, it looks like you can make a predicate based on a label selector and apply it in WatchDependentResources(), but how would you make all the dependent resources have the same label if they're created by just chunking the default helm chart? |
I've been taking a look at this. There are two things to do:
I'll try to get a PR up here, that uses my controller-runtime fork so that folks can test it. |
WIP PR here: #6377 @haywoodsh This does not completely eliminate caching, but I expect it to significantly improve the situation. Its the best of both worlds. You still get caching which reduces load on the apiserver and enables near-immediate reconciliation when operand resources are changed or deleted, but now the cache contains just the objects that are actually being managed by the operator. So now, the memory usage scales according to the number of CRs, not the size of the cluster. If possible, please try out a build of the helm-operator from my branch and let me know if this will solve your problem. |
I referred https://sdk.operatorframework.io/docs/building-operators/helm/quickstart/ and installed per suggestion from @madorn, increasing memory limit of
btw, I noticed steps in https://sdk.operatorframework.io/docs/building-operators/helm/quickstart/ are incorrect, let me know if I can submit a PR to correct them. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@joelanford we have a partner asking about this as a mutual customer is still facing issues, is your WIP PR in a polished enough state to take to main? |
Just rebased it. I was looking for feedback (that I don't think I ever got) to make sure it solved the problem. |
Feature Request
Describe the problem you need a feature to resolve.
When a Helm operator is installed in a cluster with a lot of namespaces and k8s objects deployed, the controller caches all the k8s objects, consumes a lot of memory and causes
OOMKilled
events. Watching one namespace only could potentially reduce memory usage, but our use case requires watching multiple namespaces.Describe the solution you'd like.
Disable caching for helm-operator when watching all namespaces.
/language helm
The text was updated successfully, but these errors were encountered: