New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suspected memory leak in the Rook operator #14051
Comments
Your operator log is full of error messages such as these:
Did you delete these clusters or some other unexpected operation? If you clean up these errors, it likely will stop growing in memory like that. |
These clusters still exist. As additional information, the Details will now be confirmed. If you know of any good research methods, please let me know. |
Do you have a separate operator for each namespace and then set the |
@travisn |
I'll try to investigate this issue, thanks. BTW, I found that these logs come from controller-runtime. In addition, these logs are about MultiNamespaceCache. However, I'm not sure whether MultiNamespaceCache-related-problem triggers the high memory consumption. I'm glad to know if you have encountered the same or the similar issues of mine. |
@BlaineEXE Any ideas on this multi-namespace issue with controller runtime? |
Based on my reading, I think this is controller-runtime behaving as expected. This thread (from 2022) [1] talks about how c-r can filter what objects get cached. But because Rook watches all namespaces and because of the number of different types of resources Rook managers, I think this tends to be a lot of stuff. It seems like there are probably some improvements Rook could make to help keep the cache size down, but I think that may be a bit of a tall order. For example, the dapr project did so here (big PR): dapr/dapr#5944 Each project I look at has done something slightly different, so we need to make sure the fix is tailored to Rook and considers all configurations that users might use (like multi-namespace CephObjectStoreUsers, OBCs, etc.). This isn't something that we can implement without detailed discussions and good proof-of-concept investigations. [1] https://groups.google.com/g/operator-framework/c/AIiDgRPJc00 |
I thought about this a little more also. First thought: This is somewhat related to the discussions we've been having about separating Rook into a main operator and per-cluster sub-operators. At least, my thought is that it might make the most sense to do that work first, and then this work because the sub-operator work is going to have to consider some of the same thoughts as here anyway. Second thought: While Rook uses controller-runtime today, Rook was created before c-r existed. Rook primarily uses the kubernetes API to get/update/delete resources, rather than using the controller-runtime client API. What this means is that a lot (most?) of the API calls Rook makes go directly to kubernetes rather than getting fast access via controller-runtime's cache. This is really non-ideal. But we haven't made moves to fix this because changing all these API uses is a huge undertaking. While changing the API usage wouldn't be too hard, the different API methods require substantially different unit test setups. And the downsides are generally fairly minor (like unnecessary k8s API calls, and occasional reconcile errors requiring retry). |
@travisn @BlaineEXE |
As you said, I'm not certain that there is a memory "leak" per se. That implies that memory is being allocated but not de-allocated, and I'm not certain that's what is happening. Golang has good memory management and garbage collection features. I'm also not certain that the controller-runtime cache is the issue. It's a guess (I think a good guess) but I can't confirm it. A good way to dig into this further and confirm whether the controller-runtime cache is an issue is to look at the metrics exported by controller-runtime in the Rook binary. I would hope that cache size is an available metric. If cache size is constant while the pod memory consumption increases, then we can say that it isn't the cache, as we have been predicting/guessing up to now. If the cache size is ever-increasing, then our guess may be on the right track. Even with those metrics, it may require some tracing/profiling of the rook go binary to really determine and confirm what is consuming the ever-growing memory. It may require creating a debug build of Rook. It may also require us to make some code/config changes to periodically dump a list of cached items to see what items are being cached. Are there duplicates? Unlikely? Are there cached resources that Rook has no need to watch? Seems likely to me. |
Some questions @cupnes regarding the initial screenshot in the description:
Does updating to Rook v1.13.8 fix the issue? |
@BlaineEXE
It is about 3.8 days.
There are 137 nodes in this k8s cluster.
Perhaps, you asked number of pods in the namespace that has the rook operator that is talked about this issue. The "ceph-poc" namespace has it. The ceph-poc namespace has 102 pods.
There are 4837 pods.
There are 2812 secrets.
There are 1656 configmaps.
Could you please wait for a while? Some work needs to be done to do that. |
Yes, and thanks for the additional info @cupnes . If this is an issue somehow related to the controller-runtime cache, it seems likely that this could be particularly impactful in your k8s cluster that is very large. 137 nodes and almost 5000 pods is quite a lot. If we are suspecting a cache issue, the only PR I can think of that made a notworthy change to the number of namespaces that the Rook operator manages is this one (objectstoreusers in any namespace): #12730. While this seems an unlikely thing to have led to this situation, it's not impossible. @travisn does the timing of that PR and the version release line up? It seems like that PR would have made it into v1.13.0, and I think backported to v1.12.3. |
Yes that feature was included in v1.12.3. I'm not sure either that it would be related. It didn't change any watchers fundamentally, right? Seems like it's just filtering the CRs a bit differently. @cupnes Did you only see this issue in v1.13.1 and not previously? |
Have you always had these errors in your log? |
@travisn |
@cupnes Thank you for the background on this issue. As indicated by @BlaineEXE, it would require some memory profiling to track this down. We may not have time soon to look at this. As a workaround, could you set the memory limits on the operator pod? Then it will restart automatically whenever it gets to the desired limit. |
I think so too. Memory profiling will be tried if it can be reproduced in an experimental environment.
Thank you for the useful information. We will consider applying that workaround. |
@cupnes Do you still have the errors in the operator log? |
@travisn |
Is this a bug report or feature request?
Deviation from expected behavior:
Rook operator memory usage continues to increase. The graph below shows the
container_memory_working_set_bytes
metrics for the Rook operator. The metrics have been reset as the node was restarted around 10:00 on 8 April.Expected behavior:
Memory usage does not continue to increase.
How to reproduce it (minimal and precise):
File(s) to submit:
cluster.yaml
, if necessaryLogs to submit:
Cluster Status to submit:
ceph status
command:Environment:
uname -a
): 5.15.0-58-genericrook version
inside of a Rook Pod): 1.13.1ceph -v
): 17.2.6kubectl version
): 1.27.10The text was updated successfully, but these errors were encountered: