Suspected memory leak in the Rook operator #14051

cupnes · 2024-04-10T08:08:20Z

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:
Rook operator memory usage continues to increase. The graph below shows the container_memory_working_set_bytes metrics for the Rook operator. The metrics have been reset as the node was restarted around 10:00 on 8 April.

Expected behavior:
Memory usage does not continue to increase.

How to reproduce it (minimal and precise):

File(s) to submit:

Cluster CR (custom resource), typically called cluster.yaml, if necessary
- cluster.yaml.gz

Logs to submit:

Operator's logs
- operator.log.gz

Cluster Status to submit:

Result of ceph status command:

  cluster:
    id:     71c19227-1fd2-44ba-a883-00f8248ad926
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 4d)
    mgr: a(active, since 4d), standbys: b
    osd: 3 osds: 3 up (since 5m), 3 in (since 4d)
 
  data:
    pools:   2 pools, 64 pgs
    objects: 3 objects, 577 KiB
    usage:   24 MiB used, 3.0 TiB / 3 TiB avail
    pgs:     64 active+clean

Environment:

OS (e.g. from /etc/os-release): Ubuntu 22.04.1
Kernel (e.g. uname -a): 5.15.0-58-generic
Rook version (use rook version inside of a Rook Pod): 1.13.1
Storage backend version (e.g. for ceph do ceph -v): 17.2.6
Kubernetes version (use kubectl version): 1.27.10

The text was updated successfully, but these errors were encountered:

travisn · 2024-04-10T18:06:10Z

Your operator log is full of error messages such as these:

2024-04-10 04:26:26.731566 E | ceph-cluster-controller: failed to reconcile CephCluster "ceph-canary-object-store/ceph-canary-object-store". failed to get cephCluster: unable to get: ceph-canary-object-store/ceph-canary-object-store because of unknown namespace for the cache
2024-04-10 04:26:26.731593 E | ceph-cluster-controller: failed to reconcile CephCluster "ceph-dotcom-blob-0/ceph-dotcom-blob-0". failed to get cephCluster: unable to get: ceph-dotcom-blob-0/ceph-dotcom-blob-0 because of unknown namespace for the cache
2024-04-10 04:26:26.731616 E | ceph-cluster-controller: failed to reconcile CephCluster "ceph-object-store/ceph-object-store". failed to get cephCluster: unable to get: ceph-object-store/ceph-object-store because of unknown namespace for the cache

Did you delete these clusters or some other unexpected operation? If you clean up these errors, it likely will stop growing in memory like that.

cupnes · 2024-04-11T09:12:53Z

@travisn

Did you delete these clusters or some other unexpected operation? If you clean up these errors, it likely will stop growing in memory like that.

These clusters still exist.

As additional information, the currentNamespaceOnly of the Rook operator is set to true. Yet it is strange to get cephCluster CRs for other namespaces.

Details will now be confirmed. If you know of any good research methods, please let me know.

travisn · 2024-04-11T14:04:23Z

Do you have a separate operator for each namespace and then set the currentNamespaceOnly? Then it seems like a bug that it is getting events for these other namespaces.

cupnes · 2024-04-12T01:15:28Z

@travisn
Yes, it is set up that way.
Should the title and details of this issue be rewritten as a bug that gets events in other namespaces even if currentNamespaceOnly is set? Or should it create a new issue?

cupnes · 2024-04-12T07:25:51Z

@travisn

Then it seems like a bug that it is getting events for these other namespaces.

I'll try to investigate this issue, thanks.

BTW, I found that these logs come from controller-runtime. In addition, these logs are about MultiNamespaceCache.

https://github.com/kubernetes-sigs/controller-runtime/blob/a7508decdfef48c1c022925df120df386bc2381e/pkg/cache/multi_namespace_cache.go#L242

However, I'm not sure whether MultiNamespaceCache-related-problem triggers the high memory consumption. I'm glad to know if you have encountered the same or the similar issues of mine.

travisn · 2024-04-12T13:35:30Z

@BlaineEXE Any ideas on this multi-namespace issue with controller runtime?

BlaineEXE · 2024-04-12T16:45:16Z

Based on my reading, I think this is controller-runtime behaving as expected. This thread (from 2022) [1] talks about how c-r can filter what objects get cached. But because Rook watches all namespaces and because of the number of different types of resources Rook managers, I think this tends to be a lot of stuff. It seems like there are probably some improvements Rook could make to help keep the cache size down, but I think that may be a bit of a tall order.

For example, the dapr project did so here (big PR): dapr/dapr#5944
Elastic did so here: https://github.com/elastic/cloud-on-k8s/pull/5187/files
There are others, but these 2 seem like the most relevant to Rook.

Each project I look at has done something slightly different, so we need to make sure the fix is tailored to Rook and considers all configurations that users might use (like multi-namespace CephObjectStoreUsers, OBCs, etc.). This isn't something that we can implement without detailed discussions and good proof-of-concept investigations.

[1] https://groups.google.com/g/operator-framework/c/AIiDgRPJc00

BlaineEXE · 2024-04-12T17:34:51Z

I thought about this a little more also.

First thought:

This is somewhat related to the discussions we've been having about separating Rook into a main operator and per-cluster sub-operators. At least, my thought is that it might make the most sense to do that work first, and then this work because the sub-operator work is going to have to consider some of the same thoughts as here anyway.
Link: #13848

Second thought:

While Rook uses controller-runtime today, Rook was created before c-r existed. Rook primarily uses the kubernetes API to get/update/delete resources, rather than using the controller-runtime client API. What this means is that a lot (most?) of the API calls Rook makes go directly to kubernetes rather than getting fast access via controller-runtime's cache. This is really non-ideal. But we haven't made moves to fix this because changing all these API uses is a huge undertaking. While changing the API usage wouldn't be too hard, the different API methods require substantially different unit test setups. And the downsides are generally fairly minor (like unnecessary k8s API calls, and occasional reconcile errors requiring retry).

cupnes · 2024-04-16T01:35:23Z

@travisn @BlaineEXE
Thank you for your comments. I understood that various considerations need to be taken into account when fixing this error.
On the other hand, I do not know if this error is the cause of the memory leak or not. Why do you think this error is the cause of the memory leak? Is it possible that this error is not the cause of the memory leak?

BlaineEXE · 2024-04-16T14:16:27Z

As you said, I'm not certain that there is a memory "leak" per se. That implies that memory is being allocated but not de-allocated, and I'm not certain that's what is happening. Golang has good memory management and garbage collection features.

I'm also not certain that the controller-runtime cache is the issue. It's a guess (I think a good guess) but I can't confirm it.

A good way to dig into this further and confirm whether the controller-runtime cache is an issue is to look at the metrics exported by controller-runtime in the Rook binary. I would hope that cache size is an available metric. If cache size is constant while the pod memory consumption increases, then we can say that it isn't the cache, as we have been predicting/guessing up to now. If the cache size is ever-increasing, then our guess may be on the right track.

Even with those metrics, it may require some tracing/profiling of the rook go binary to really determine and confirm what is consuming the ever-growing memory. It may require creating a debug build of Rook. It may also require us to make some code/config changes to periodically dump a list of cached items to see what items are being cached. Are there duplicates? Unlikely? Are there cached resources that Rook has no need to watch? Seems likely to me.

BlaineEXE · 2024-04-16T14:28:05Z

Some questions @cupnes regarding the initial screenshot in the description:

How long was the green-colored (from the graph) operator running before it got restarted? (my best estimation based on the graph is 4.5 days)
How many nodes are in the k8s cluster?
How many pods in the rook-ceph namespace?
How many pods are in the cluster counting all namespaces?
How many secrets are in the cluster counting all namespaces?
How many configmaps are in the cluster among all namespaces?

Does updating to Rook v1.13.8 fix the issue?
Does updating to Rook v1.14.z fix the issue?

cupnes · 2024-04-17T04:55:52Z

@BlaineEXE
Thank you. I understood the method of investigating.

Some questions @cupnes regarding the initial screenshot in the description:

How long was the green-colored (from the graph) operator running before it got restarted? (my best estimation based on the graph is 4.5 days)

It is about 3.8 days.

How many nodes are in the k8s cluster?

There are 137 nodes in this k8s cluster.

How many pods in the rook-ceph namespace?

Perhaps, you asked number of pods in the namespace that has the rook operator that is talked about this issue. The "ceph-poc" namespace has it. The ceph-poc namespace has 102 pods.

How many pods are in the cluster counting all namespaces?

There are 4837 pods.

How many secrets are in the cluster counting all namespaces?

There are 2812 secrets.

How many configmaps are in the cluster among all namespaces?

There are 1656 configmaps.

Does updating to Rook v1.13.8 fix the issue?
Does updating to Rook v1.14.z fix the issue?

Could you please wait for a while? Some work needs to be done to do that.

BlaineEXE · 2024-04-17T18:34:25Z

Could you please wait for a while? Some work needs to be done to do that.

Yes, and thanks for the additional info @cupnes .

If this is an issue somehow related to the controller-runtime cache, it seems likely that this could be particularly impactful in your k8s cluster that is very large. 137 nodes and almost 5000 pods is quite a lot.

If we are suspecting a cache issue, the only PR I can think of that made a notworthy change to the number of namespaces that the Rook operator manages is this one (objectstoreusers in any namespace): #12730. While this seems an unlikely thing to have led to this situation, it's not impossible. @travisn does the timing of that PR and the version release line up? It seems like that PR would have made it into v1.13.0, and I think backported to v1.12.3.

travisn · 2024-04-17T18:56:38Z

If we are suspecting a cache issue, the only PR I can think of that made a notworthy change to the number of namespaces that the Rook operator manages is this one (objectstoreusers in any namespace): #12730. While this seems an unlikely thing to have led to this situation, it's not impossible. @travisn does the timing of that PR and the version release line up? It seems like that PR would have made it into v1.13.0, and I think backported to v1.12.3.

Yes that feature was included in v1.12.3. I'm not sure either that it would be related. It didn't change any watchers fundamentally, right? Seems like it's just filtering the CRs a bit differently.

@cupnes Did you only see this issue in v1.13.1 and not previously?

cupnes · 2024-04-18T04:57:51Z

@travisn

@cupnes Did you only see this issue in v1.13.1 and not previously?

I have seen this issue for a long time. This problem also occurred in v1.10.7, v1.11.6, and v1.12.5, which were used during that time.

travisn · 2024-04-18T16:53:00Z

Have you always had these errors in your log? unknown namespace for the cache
This still seems related to the issue, since we haven't had other reports of memory increases like this.

cupnes · 2024-04-22T08:05:53Z

@travisn
Recently, I have always seen the error (unknown namespace for the cache). That error appeared on 19 Jan, then stopped appearing for a while, and has continued to appear since 26 Jan.
We started using Rook v1.13.1 on 16 January. Memory leaks have been occurring since we were using v1.10.7. The error may be unrelated to the memory leak.

travisn · 2024-04-22T20:50:25Z

@cupnes Thank you for the background on this issue. As indicated by @BlaineEXE, it would require some memory profiling to track this down. We may not have time soon to look at this.

As a workaround, could you set the memory limits on the operator pod? Then it will restart automatically whenever it gets to the desired limit.

cupnes · 2024-04-23T08:21:42Z

@travisn

@cupnes Thank you for the background on this issue. As indicated by @BlaineEXE, it would require some memory profiling to track this down. We may not have time soon to look at this.

I think so too. Memory profiling will be tried if it can be reproduced in an experimental environment.

As a workaround, could you set the memory limits on the operator pod? Then it will restart automatically whenever it gets to the desired limit.

Thank you for the useful information. We will consider applying that workaround.

cupnes · 2024-05-02T07:11:24Z

After updating to Rook v1.13.8, the memory usage no longer continues to increase. Were there any changes between v1.13.1 and v1.13.8 that may have resolved this issue? In my investigation, it seems there are no such commits. So I wonder why resolve this issue by updating Rook to v1.13.8.

travisn · 2024-05-02T21:05:51Z

@cupnes Do you still have the errors in the operator log? unknown namespace for the cache
Or did those errors go away after the upgrade?

cupnes · 2024-05-07T04:35:36Z

@travisn
I don't have that error anymore. I got that error once, about 20 minutes after the upgrade, but I haven't had that error for many days since then. (It may be that currentNamespaceOnly is now in effect due to the upgrade.)
The problem has been resolved and this issue is closed. Thank you for your cooperation.

cupnes added the bug label Apr 10, 2024

cupnes closed this as completed May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suspected memory leak in the Rook operator #14051

Suspected memory leak in the Rook operator #14051

cupnes commented Apr 10, 2024

travisn commented Apr 10, 2024

cupnes commented Apr 11, 2024

travisn commented Apr 11, 2024

cupnes commented Apr 12, 2024

cupnes commented Apr 12, 2024

travisn commented Apr 12, 2024

BlaineEXE commented Apr 12, 2024

BlaineEXE commented Apr 12, 2024 •

edited

cupnes commented Apr 16, 2024

BlaineEXE commented Apr 16, 2024

BlaineEXE commented Apr 16, 2024

cupnes commented Apr 17, 2024

BlaineEXE commented Apr 17, 2024

travisn commented Apr 17, 2024

cupnes commented Apr 18, 2024

travisn commented Apr 18, 2024

cupnes commented Apr 22, 2024

travisn commented Apr 22, 2024

cupnes commented Apr 23, 2024

cupnes commented May 2, 2024

travisn commented May 2, 2024

cupnes commented May 7, 2024

Suspected memory leak in the Rook operator #14051

Suspected memory leak in the Rook operator #14051

Comments

cupnes commented Apr 10, 2024

travisn commented Apr 10, 2024

cupnes commented Apr 11, 2024

travisn commented Apr 11, 2024

cupnes commented Apr 12, 2024

cupnes commented Apr 12, 2024

travisn commented Apr 12, 2024

BlaineEXE commented Apr 12, 2024

BlaineEXE commented Apr 12, 2024 • edited

cupnes commented Apr 16, 2024

BlaineEXE commented Apr 16, 2024

BlaineEXE commented Apr 16, 2024

cupnes commented Apr 17, 2024

BlaineEXE commented Apr 17, 2024

travisn commented Apr 17, 2024

cupnes commented Apr 18, 2024

travisn commented Apr 18, 2024

cupnes commented Apr 22, 2024

travisn commented Apr 22, 2024

cupnes commented Apr 23, 2024

cupnes commented May 2, 2024

travisn commented May 2, 2024

cupnes commented May 7, 2024

BlaineEXE commented Apr 12, 2024 •

edited