New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operator pod restarting very frequently #9132
Comments
Are the env resources constrained? |
Not sure if this is the answer you're looking for but Rook has access to a single partition which has 10Gi of free storage. None of that 10Gi is being used in the current setup. Apologies if you were talking about a different kind of resource. |
No worries, I was more thinking of CPU and RAM actually. If the operator does not have many resources available to run then we might likely reach timeouts like this. |
Aha, OK, sorry about that. The Kubernetes cluster has access to 64Gi of RAM and 8 CPU cores. Not sure if it helps, but the k3s logs at the time of the pod erroring are as follows:
|
Is the error always related to |
Apologies for the late reply, I had a long weekend away from the computer. From what I've seen, the error is always related to |
I'm not sure what would be causing this. @yuvalif have you seen anything like this during your development? |
i haven't seen that. but i was testing only with a small number of OBCs. |
@tg137 could you confirm how many ObjectBucketClaims (OBCs) you have created in your test cluster? |
@tg137 : can you also check whether rook cluster have the crds for bucket notification or topics. Or u will be using the latest crds from 1.7 branch but ur rook version is below than 1.7.6 which does not these controllers |
Sorry all, I have torn down this cluster now. Thanks for trying to help but it sounds like it was potentially just a me problem. I will close for now and if I revisit this in the future I will re-open with the info you're asking for. |
I would recommend re-opening. I had this same issue when upgrading the helm chart from Happy to provide additional details, though i don't know how to find out # of ObjectBucketClaims for example so could use some guidance there. |
Re-opening, I'm also seeing this here #9384 |
I might have a repro when the scheme is missing an object. For instance, I was missing |
trying to fix operator crashes, see rook/rook#9132
Upon upgrading from 1.7.9 to 1.8.1, I'm encountering this issue as well. File(s) to submit:
Environment:
Yes, I believe so, it's definitely consistent since i started looking for it late last night and into this morning. ( I upgraded yesterday evening, which is when this started as far as I know)
I think zero, is this what you mean by that?
Let me know if I can provide any further information or debugging steps to try. I may roll this back to 1.7.9 just to get things back into a good state, but i'll leave it as is for a few days and can always roll forward again to 1.8.1 or newer if needed. |
…532)" This reverts commit eebccd5. rook/rook#9132 allenporter/k8s-gitops#445 Signed-off-by: Anthony Rabbito <hello@anthonyrabbito.com>
No dice for me, same error after going from 1.7.10 > 1.8.2. File(s) to submit:
|
@jgilfoil would it be possible to get gists of the CRDs ( |
@BlaineEXE thanks for giving this attention. Based on your hint to look at the crds, i did some comparing and it seems the crds in my system aren't getting updated as expected to the versions for v1.8.x. I expected flux to pull them in after updating this line, but apparently that's not happening properly, so I need to go debug that process. I have the debug logs from the operator, and in case the crd issue is a red herring, i'll update here, but I'm thinking this probably isn't an issue with rook, for me at least. Thanks for your help! |
Thanks @jgilfoil, my issue was also about crds not being upgraded, so my fault. I had crds disabled in the helm chart and forgot about it when reading the upgrade instructions, and didn't get the hint based on the error messages. I'm good with marking resolved also. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation. |
Is this a bug report or feature request?
Deviation from expected behavior:
Setting up Rook on one node with one OSD using
crds.yaml
,common.yaml
,operator.yaml
andcluster-test.yaml
from the current master branch causes regular failures of the operator pod with the following error:The pod then exits and is restarted. The pod runs successfully for a couple of minutes after restarting but then exits again with the above error.
Whilst the operator pod is running (i.e after setup and before it crashes), I am able to provision storage. Whilst it is crashed, I am not able to.
Expected behavior:
To reproduce, follow the instructions outlined here with the exception of creating
cluster-test.yaml
instead ofcluster.yaml
at the end.Expected to have a successful Rook deployment including an operator that is able to run for more than 5 minutes without restarting.
File(s) to submit:
Environment:
uname -a
):rook version
inside of a Rook Pod):ceph -v
):kubectl version
):ceph health
in the Rook Ceph toolbox):HEALTH_OK
The text was updated successfully, but these errors were encountered: