Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable caching for Helm Operator when watching all namespaces to avoid OOMKilled in big clusters #6255

Closed
haywoodsh opened this issue Jan 20, 2023 · 14 comments · Fixed by #6377
Assignees
Labels
language/helm Issue is related to a Helm operator project lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Milestone

Comments

@haywoodsh
Copy link

Feature Request

Describe the problem you need a feature to resolve.

When a Helm operator is installed in a cluster with a lot of namespaces and k8s objects deployed, the controller caches all the k8s objects, consumes a lot of memory and causes OOMKilled events. Watching one namespace only could potentially reduce memory usage, but our use case requires watching multiple namespaces.

Describe the solution you'd like.

Disable caching for helm-operator when watching all namespaces.

/language helm

@openshift-ci openshift-ci bot added the language/helm Issue is related to a Helm operator project label Jan 20, 2023
@haywoodsh haywoodsh changed the title Disable caching for Helm Operator to avoid OOMKilled when watching all namespaces in big clusters Disable caching for Helm Operator when watching all namespaces to avoid OOMKilled in big clusters Jan 20, 2023
@jberkhahn
Copy link
Contributor

So, we are aware of some memory issues in the Helm operator. We actually just merged a pull request that might help with this problem. We should be cutting a release this week or so, could you try rebuilding your operator with the new version and see if that helps? If not we can trying adding something like this.

@jberkhahn jberkhahn added this to the Backlog milestone Jan 23, 2023
@jberkhahn jberkhahn added the triage/needs-information Indicates an issue needs more information in order to work on it. label Jan 23, 2023
@jberkhahn jberkhahn self-assigned this Jan 23, 2023
@madorn
Copy link
Member

madorn commented Jan 24, 2023

@jberkhahn @joelanford Thanks for the follow up. Will try the latest PR.

Any info on the possibility of disabling manager cache with helm-operator?

@jberkhahn jberkhahn removed their assignment Mar 15, 2023
@jberkhahn jberkhahn removed this from the Backlog milestone Mar 15, 2023
@joelanford
Copy link
Member

Another thought: when we create dynamic watches for the operand objects, do we use a label selector to ensure we only watch and cache the subset of those objects that are actually managed by the operator? I'm guessing we don't and that if we did, it would be a significant improvement, especially on larger clusters.

@theishshah theishshah added this to the Backlog milestone Mar 20, 2023
@varshaprasad96
Copy link
Member

@jberkhahn Adding some links here in case it helps during investigation:
Using of label selector while starting manager to selectively cache objects: https://github.com/kubernetes-sigs/controller-runtime/blob/main/designs/use-selectors-at-cache.md
Place where we set up the controller to watch the Helm CR:

type HelmOperatorReconciler struct {

Place where we set up reconciler for each API defined in watches.yaml:
err := controller.Add(mgr, controller.WatchOptions{

This is where we add watches for resources in released Helm charts:
func watchDependentResources(mgr manager.Manager, r *HelmOperatorReconciler, c controller.Controller) {
. Since we use the upstream Helm logic to install/uninstall/upgrade the release, I think adding label selectors to cache objects would be in "watchDependentResources".

@jberkhahn
Copy link
Contributor

so, it looks like you can make a predicate based on a label selector and apply it in WatchDependentResources(), but how would you make all the dependent resources have the same label if they're created by just chunking the default helm chart?

@joelanford
Copy link
Member

I've been taking a look at this. There are two things to do:

  1. For the cache configuration, we need to configure a custom NewCache function on the manager options prior to calling manager.New. This is actually not possible right now due to a bug in controller runtime. I've got a PR up that fixes an issue that helm-operator runs into since the primary CR is unstructured.Unstructured. 🐛 Preserve unstructured object GVKs in cache.Options.ByObject kubernetes-sigs/controller-runtime#2246

  2. As @jberkhahn says, we need some label to key off of. My suggestion would be for us to add a new label, something along the lines of helm.sdk.operatorframework.io/chart: <chartName>-<chartVersion>. The easiest spot to add that is in the ownerRefInjectingClient, but it feels a bit dirty to inject a label via a client that says it injects owner refs. Perhaps we rename that client to something a bit more generic that makes us feel better about injecting a few different things.

I'll try to get a PR up here, that uses my controller-runtime fork so that folks can test it.

@joelanford
Copy link
Member

WIP PR here: #6377

@haywoodsh This does not completely eliminate caching, but I expect it to significantly improve the situation. Its the best of both worlds. You still get caching which reduces load on the apiserver and enables near-immediate reconciliation when operand resources are changed or deleted, but now the cache contains just the objects that are actually being managed by the operator. So now, the memory usage scales according to the number of CRs, not the size of the cluster.

If possible, please try out a build of the helm-operator from my branch and let me know if this will solve your problem.

@vhosakot
Copy link

I referred https://sdk.operatorframework.io/docs/building-operators/helm/quickstart/ and installed nginx-operator and saw it OOM continuously on OpenShift 4.11.

per suggestion from @madorn, increasing memory limit of nginx-operator-controller-manager deployment to 4Gi fixed the OOM crash of nginx-operator-controller-manager pod, thanks @madorn:

% kubectl get deploy nginx-operator-controller-manager -o yaml
...
        image: gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
        resources:
          limits:
            memory: 4Gi <====
...
        image: vhosakot/nginx-operator:v0.0.1
        resources:
          limits:
            memory: 4Gi <====

$ kubectl get pods
...
nginx-operator-controller-manager-56fff5db64-q9rrg    2/2     Running     0          15m
...

btw, I noticed steps in https://sdk.operatorframework.io/docs/building-operators/helm/quickstart/ are incorrect, let me know if I can submit a PR to correct them.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 30, 2023
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 30, 2023
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this as completed Aug 30, 2023
@openshift-ci
Copy link

openshift-ci bot commented Aug 30, 2023

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@joelanford joelanford added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed triage/needs-information Indicates an issue needs more information in order to work on it. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Sep 19, 2023
@joelanford joelanford reopened this Sep 19, 2023
@acornett21
Copy link
Contributor

@joelanford we have a partner asking about this as a mutual customer is still facing issues, is your WIP PR in a polished enough state to take to main?

@joelanford
Copy link
Member

Just rebased it. I was looking for feedback (that I don't think I ever got) to make sure it solved the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language/helm Issue is related to a Helm operator project lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants