Disable caching for Helm Operator when watching all namespaces to avoid OOMKilled in big clusters #6255

haywoodsh · 2023-01-20T18:44:34Z

Feature Request

Describe the problem you need a feature to resolve.

When a Helm operator is installed in a cluster with a lot of namespaces and k8s objects deployed, the controller caches all the k8s objects, consumes a lot of memory and causes OOMKilled events. Watching one namespace only could potentially reduce memory usage, but our use case requires watching multiple namespaces.

Describe the solution you'd like.

Disable caching for helm-operator when watching all namespaces.

/language helm

The text was updated successfully, but these errors were encountered:

jberkhahn · 2023-01-23T19:56:56Z

So, we are aware of some memory issues in the Helm operator. We actually just merged a pull request that might help with this problem. We should be cutting a release this week or so, could you try rebuilding your operator with the new version and see if that helps? If not we can trying adding something like this.

madorn · 2023-01-24T19:15:16Z

@jberkhahn @joelanford Thanks for the follow up. Will try the latest PR.

Any info on the possibility of disabling manager cache with helm-operator?

joelanford · 2023-03-17T17:13:44Z

Another thought: when we create dynamic watches for the operand objects, do we use a label selector to ensure we only watch and cache the subset of those objects that are actually managed by the operator? I'm guessing we don't and that if we did, it would be a significant improvement, especially on larger clusters.

varshaprasad96 · 2023-03-20T20:07:03Z

@jberkhahn Adding some links here in case it helps during investigation:
Using of label selector while starting manager to selectively cache objects: https://github.com/kubernetes-sigs/controller-runtime/blob/main/designs/use-selectors-at-cache.md
Place where we set up the controller to watch the Helm CR:

operator-sdk/internal/helm/controller/reconcile.go

Line 48 in 484013d

type HelmOperatorReconciler struct {

Place where we set up reconciler for each API defined in watches.yaml:

operator-sdk/internal/cmd/helm-operator/run/cmd.go

Line 209 in e859a56

err := controller.Add(mgr, controller.WatchOptions{

This is where we add watches for resources in released Helm charts:

operator-sdk/internal/helm/controller/controller.go

Line 127 in e859a56

    
           func watchDependentResources(mgr manager.Manager, r *HelmOperatorReconciler, c controller.Controller) {

. Since we use the upstream Helm logic to install/uninstall/upgrade the release, I think adding label selectors to cache objects would be in "watchDependentResources".

jberkhahn · 2023-03-22T18:24:46Z

so, it looks like you can make a predicate based on a label selector and apply it in WatchDependentResources(), but how would you make all the dependent resources have the same label if they're created by just chunking the default helm chart?

joelanford · 2023-03-23T18:21:22Z

I've been taking a look at this. There are two things to do:

For the cache configuration, we need to configure a custom NewCache function on the manager options prior to calling manager.New. This is actually not possible right now due to a bug in controller runtime. I've got a PR up that fixes an issue that helm-operator runs into since the primary CR is unstructured.Unstructured. 🐛 Preserve unstructured object GVKs in cache.Options.ByObject kubernetes-sigs/controller-runtime#2246
As @jberkhahn says, we need some label to key off of. My suggestion would be for us to add a new label, something along the lines of helm.sdk.operatorframework.io/chart: <chartName>-<chartVersion>. The easiest spot to add that is in the ownerRefInjectingClient, but it feels a bit dirty to inject a label via a client that says it injects owner refs. Perhaps we rename that client to something a bit more generic that makes us feel better about injecting a few different things.

I'll try to get a PR up here, that uses my controller-runtime fork so that folks can test it.

joelanford · 2023-03-24T01:41:32Z

WIP PR here: #6377

@haywoodsh This does not completely eliminate caching, but I expect it to significantly improve the situation. Its the best of both worlds. You still get caching which reduces load on the apiserver and enables near-immediate reconciliation when operand resources are changed or deleted, but now the cache contains just the objects that are actually being managed by the operator. So now, the memory usage scales according to the number of CRs, not the size of the cluster.

If possible, please try out a build of the helm-operator from my branch and let me know if this will solve your problem.

vhosakot · 2023-03-31T21:33:49Z

I referred https://sdk.operatorframework.io/docs/building-operators/helm/quickstart/ and installed nginx-operator and saw it OOM continuously on OpenShift 4.11.

per suggestion from @madorn, increasing memory limit of nginx-operator-controller-manager deployment to 4Gi fixed the OOM crash of nginx-operator-controller-manager pod, thanks @madorn:

% kubectl get deploy nginx-operator-controller-manager -o yaml
...
        image: gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
        resources:
          limits:
            memory: 4Gi <====
...
        image: vhosakot/nginx-operator:v0.0.1
        resources:
          limits:
            memory: 4Gi <====

$ kubectl get pods
...
nginx-operator-controller-manager-56fff5db64-q9rrg    2/2     Running     0          15m
...

btw, I noticed steps in https://sdk.operatorframework.io/docs/building-operators/helm/quickstart/ are incorrect, let me know if I can submit a PR to correct them.

openshift-bot · 2023-06-30T01:00:58Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2023-07-30T08:30:38Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2023-08-30T00:00:18Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2023-08-30T00:00:22Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

acornett21 · 2023-09-19T16:22:50Z

@joelanford we have a partner asking about this as a mutual customer is still facing issues, is your WIP PR in a polished enough state to take to main?

joelanford · 2023-09-21T14:27:44Z

Just rebased it. I was looking for feedback (that I don't think I ever got) to make sure it solved the problem.

openshift-ci bot added the language/helm Issue is related to a Helm operator project label Jan 20, 2023

haywoodsh changed the title ~~Disable caching for Helm Operator to avoid OOMKilled when watching all namespaces in big clusters~~ Disable caching for Helm Operator when watching all namespaces to avoid OOMKilled in big clusters Jan 20, 2023

jberkhahn added this to the Backlog milestone Jan 23, 2023

jberkhahn added the triage/needs-information Indicates an issue needs more information in order to work on it. label Jan 23, 2023

jberkhahn self-assigned this Jan 23, 2023

jberkhahn removed their assignment Mar 15, 2023

jberkhahn removed this from the Backlog milestone Mar 15, 2023

theishshah assigned jberkhahn Mar 20, 2023

theishshah added this to the Backlog milestone Mar 20, 2023

joelanford mentioned this issue Mar 24, 2023

helm-operator: reduce cache memory footprint #6377

Merged

2 tasks

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 30, 2023

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 30, 2023

openshift-ci bot closed this as completed Aug 30, 2023

joelanford reopened this Sep 19, 2023

joelanford closed this as completed in #6377 Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable caching for Helm Operator when watching all namespaces to avoid OOMKilled in big clusters #6255

Disable caching for Helm Operator when watching all namespaces to avoid OOMKilled in big clusters #6255

haywoodsh commented Jan 20, 2023

jberkhahn commented Jan 23, 2023

madorn commented Jan 24, 2023

joelanford commented Mar 17, 2023

varshaprasad96 commented Mar 20, 2023

jberkhahn commented Mar 22, 2023

joelanford commented Mar 23, 2023

joelanford commented Mar 24, 2023

vhosakot commented Mar 31, 2023

openshift-bot commented Jun 30, 2023

openshift-bot commented Jul 30, 2023

openshift-bot commented Aug 30, 2023

openshift-ci bot commented Aug 30, 2023

acornett21 commented Sep 19, 2023

joelanford commented Sep 21, 2023

Disable caching for Helm Operator when watching all namespaces to avoid OOMKilled in big clusters #6255

Disable caching for Helm Operator when watching all namespaces to avoid OOMKilled in big clusters #6255

Comments

haywoodsh commented Jan 20, 2023

Feature Request

Describe the problem you need a feature to resolve.

Describe the solution you'd like.

jberkhahn commented Jan 23, 2023

madorn commented Jan 24, 2023

joelanford commented Mar 17, 2023

varshaprasad96 commented Mar 20, 2023

jberkhahn commented Mar 22, 2023

joelanford commented Mar 23, 2023

joelanford commented Mar 24, 2023

vhosakot commented Mar 31, 2023

openshift-bot commented Jun 30, 2023

openshift-bot commented Jul 30, 2023

openshift-bot commented Aug 30, 2023

openshift-ci bot commented Aug 30, 2023

acornett21 commented Sep 19, 2023

joelanford commented Sep 21, 2023