Operator pod restarting very frequently #9132

tg137 · 2021-11-09T14:24:49Z

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:

Setting up Rook on one node with one OSD using crds.yaml, common.yaml, operator.yaml and cluster-test.yaml from the current master branch causes regular failures of the operator pod with the following error:

2021-11-09 14:12:58.663748 C | rookcmd: failed to run operator: gave up to run the operator manager: failed to run the controller-runtime manager: failed to wait for ceph-bucket-notification-controller caches to sync: timed out waiting for cache to be synced

The pod then exits and is restarted. The pod runs successfully for a couple of minutes after restarting but then exits again with the above error.

Whilst the operator pod is running (i.e after setup and before it crashes), I am able to provision storage. Whilst it is crashed, I am not able to.

Expected behavior:

To reproduce, follow the instructions outlined here with the exception of creating cluster-test.yaml instead of cluster.yaml at the end.

Expected to have a successful Rook deployment including an operator that is able to run for more than 5 minutes without restarting.

File(s) to submit:

Cluster YAML used: https://github.com/rook/rook/blob/master/cluster/examples/kubernetes/ceph/cluster-test.yaml
Full operator log file attached, including failure at the end (operator-logs.txt)

Environment:

OS (e.g. from /etc/os-release):

NAME=NixOS
ID=nixos
VERSION="21.05.3916.3b1789322fc (Okapi)"

Kernel (e.g. uname -a):

Linux nixos 5.10.75 #1-NixOS SMP Wed Oct 20 09:45:06 UTC 2021 x86_64 GNU/Linux

Cloud provider or hardware configuration: Running locally in a single-node K3s Kubernetes
Rook version (use rook version inside of a Rook Pod):

rook: v1.7.0-alpha.0.589.g8cfaedc
go: go1.16.7

Storage backend version (e.g. for ceph do ceph -v):

ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"archive", BuildDate:"1980-01-01T00:00:00Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0+k3s1", GitCommit:"2705431d9645d128441c578309574cd262285ae6", GitTreeState:"clean", BuildDate:"2021-09-14T00:00:00Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):
HEALTH_OK

The text was updated successfully, but these errors were encountered:

leseb · 2021-11-09T16:01:24Z

Are the env resources constrained?

tg137 · 2021-11-09T21:45:55Z

Not sure if this is the answer you're looking for but Rook has access to a single partition which has 10Gi of free storage. None of that 10Gi is being used in the current setup.

Apologies if you were talking about a different kind of resource.

leseb · 2021-11-10T07:51:02Z

No worries, I was more thinking of CPU and RAM actually. If the operator does not have many resources available to run then we might likely reach timeouts like this.

tg137 · 2021-11-10T08:57:51Z

Aha, OK, sorry about that.

The Kubernetes cluster has access to 64Gi of RAM and 8 CPU cores.

Not sure if it helps, but the k3s logs at the time of the pod erroring are as follows:

Nov 09 14:12:51 nixos k3s[206199]: time="2021-11-09T14:12:51.014813773Z" level=info msg="Cluster-Http-Server 2021/11/09 14:12:51 http: TLS handshake error from 10.42.0.85:45722: EOF"
Nov 09 14:12:59 nixos k3s[206199]: I1109 14:12:59.038636  206199 scope.go:111] "RemoveContainer" containerID="20db35b76efc4fba439af3be455eaa7af9bcd2566111524485667c4deeb0b1d2"
Nov 09 14:12:59 nixos k3s[206199]: I1109 14:12:59.038881  206199 scope.go:111] "RemoveContainer" containerID="1c00f752fa9728b01e10a6efde168c23814da41c3218389ec5466866f3fc7e92"
Nov 09 14:12:59 nixos k3s[206199]: E1109 14:12:59.039100  206199 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"rook-ceph-operator\" with CrashLoopBackOff: \"back-off 2m40s restarting failed container=rook-ceph-operator pod=rook-ce>
Nov 09 14:12:59 nixos k3s[206199]: I1109 14:12:59.666424  206199 trace.go:205] Trace[375646972]: "Get" url:/api/v1/namespaces/rook-ceph/pods/rook-ceph-operator-76dc868c4b-d54ls/log,user-agent:kubectl/v1.21.1 (linux/amd64) kubernetes/5e58841,client:127.0.0.1,accept:applicati>
Nov 09 14:12:59 nixos k3s[206199]: Trace[375646972]: ---"Transformed response object" 34276ms (14:12:00.666)
Nov 09 14:12:59 nixos k3s[206199]: Trace[375646972]: [34.278046136s] [34.278046136s] END

BlaineEXE · 2021-11-10T23:31:46Z

Is the error always related to ceph-bucket-notification-controller?

tg137 · 2021-11-15T08:24:49Z

Apologies for the late reply, I had a long weekend away from the computer.

From what I've seen, the error is always related to ceph-bucket-notification-controller.

BlaineEXE · 2021-11-15T16:53:25Z

I'm not sure what would be causing this. @yuvalif have you seen anything like this during your development?

yuvalif · 2021-11-15T17:04:40Z

I'm not sure what would be causing this. @yuvalif have you seen anything like this during your development?

i haven't seen that. but i was testing only with a small number of OBCs.
will try to reproduce with a large number

BlaineEXE · 2021-11-15T18:13:23Z

@tg137 could you confirm how many ObjectBucketClaims (OBCs) you have created in your test cluster?

thotz · 2021-11-23T09:30:44Z

@tg137 : can you also check whether rook cluster have the crds for bucket notification or topics. Or u will be using the latest crds from 1.7 branch but ur rook version is below than 1.7.6 which does not these controllers

tg137 · 2021-11-23T10:37:57Z

Sorry all, I have torn down this cluster now. Thanks for trying to help but it sounds like it was potentially just a me problem. I will close for now and if I revisit this in the future I will re-open with the info you're asking for.

allenporter · 2021-12-12T05:32:39Z

I would recommend re-opening.

I had this same issue when upgrading the helm chart from 1.7.9 to 1.8.0. My impression is that the helm chart is handling CRD upgrades.

Happy to provide additional details, though i don't know how to find out # of ObjectBucketClaims for example so could use some guidance there.

leseb · 2021-12-13T14:03:53Z

Re-opening, I'm also seeing this here #9384

leseb · 2021-12-13T14:32:02Z

I might have a repro when the scheme is missing an object. For instance, I was missing CephFilesystemSubVolumeGroupList.

trying to fix operator crashes, see rook/rook#9132

jgilfoil · 2021-12-21T15:48:11Z

Upon upgrading from 1.7.9 to 1.8.1, I'm encountering this issue as well.

File(s) to submit:

Operator helm values used: https://github.com/jgilfoil/k8s-gitops/blob/8a947a6653757e00699396a52f3a16464a22b733/cluster/core/rook-ceph/helm-release.yaml
Cluster helm values used: https://github.com/jgilfoil/k8s-gitops/blob/8a947a6653757e00699396a52f3a16464a22b733/cluster/core/rook-ceph/storage/helm-release.yaml
Operator log file: https://gist.github.com/jgilfoil/94f80c2c0b8b2d63c5fb20d508a10022

Environment:

OS (e.g. from /etc/os-release):

NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
VERSION_ID="20.04"

Kernel (e.g. uname -a):

Linux odroid-01 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Cloud provider or hardware configuration: Running locally on 3x x86 SBC's (32gb memory, 4 cores each)
Rook version (use rook version inside of a Rook Pod):

rook: v1.8.1
go: go1.16.12

Storage backend version (e.g. for ceph do ceph -v):

ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:10:45Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.8+k3s1", GitCommit:"cbff7350ecb68bf3070deb30f55611a37c42f470", GitTreeState:"clean", BuildDate:"2021-12-18T01:11:18Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}

Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

  cluster:
    id:     78b91dcd-c6f6-416d-8a9d-63d9d3ed62be
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 10h)
    mgr: a(active, since 11h)
    osd: 3 osds: 3 up (since 11h), 3 in (since 4w)

  data:
    pools:   9 pools, 120 pgs
    objects: 25.83k objects, 99 GiB
    usage:   297 GiB used, 2.4 TiB / 2.7 TiB avail
    pgs:     120 active+clean

  io:
    client:   6.7 KiB/s wr, 0 op/s rd, 0 op/s wr

Is the error always related to ceph-bucket-notification-controller?

Yes, I believe so, it's definitely consistent since i started looking for it late last night and into this morning. ( I upgraded yesterday evening, which is when this started as far as I know)

could you confirm how many ObjectBucketClaims (OBCs) you have created in your test cluster?

I think zero, is this what you mean by that?

$ kubectl get ObjectBucketClaims --all-namespaces
No resources found

Let me know if I can provide any further information or debugging steps to try. I may roll this back to 1.7.9 just to get things back into a good state, but i'll leave it as is for a few days and can always roll forward again to 1.8.1 or newer if needed.

…532)" This reverts commit eebccd5. rook/rook#9132 allenporter/k8s-gitops#445 Signed-off-by: Anthony Rabbito <hello@anthonyrabbito.com>

travisn · 2022-01-05T14:51:06Z

Discussing with @leseb, we cannot repro in the latest master, it may be fixed from #9384 with v1.8.2 (targeted for tomorrow).

jgilfoil · 2022-01-13T00:04:53Z

No dice for me, same error after going from 1.7.10 > 1.8.2.

File(s) to submit:

Operator helm values used: https://github.com/jgilfoil/k8s-gitops/blob/0bf1385de5a3fd642250887d716648470dc78cc6/cluster/core/rook-ceph/helm-release.yaml
Cluster helm values used: https://github.com/jgilfoil/k8s-gitops/blob/0bf1385de5a3fd642250887d716648470dc78cc6/cluster/core/rook-ceph/storage/helm-release.yaml
Operator log file: https://gist.github.com/jgilfoil/f21224f21ef2baf3260be71b84d76099

BlaineEXE · 2022-01-13T23:33:07Z

@jgilfoil would it be possible to get gists of the CRDs (kubectl describe crds) and the operator logs with ROOK_LOG_LEVEL=DEBUG. I see in your case it's the BucketTopic that is having trouble sync-ing, but I'm not sure why that would be. I'm wondering if there could be some issue with the new CRDs or with something in the new controllers that I hope this info can help us troubleshoot.

jgilfoil · 2022-01-14T16:29:36Z

@BlaineEXE thanks for giving this attention. Based on your hint to look at the crds, i did some comparing and it seems the crds in my system aren't getting updated as expected to the versions for v1.8.x. I expected flux to pull them in after updating this line, but apparently that's not happening properly, so I need to go debug that process. I have the debug logs from the operator, and in case the crd issue is a red herring, i'll update here, but I'm thinking this probably isn't an issue with rook, for me at least. Thanks for your help!

allenporter · 2022-01-15T16:10:25Z

Thanks @jgilfoil, my issue was also about crds not being upgraded, so my fault. I had crds disabled in the helm chart and forgot about it when reading the upgrade instructions, and didn't get the hint based on the error messages.

I'm good with marking resolved also.

github-actions · 2022-03-16T20:02:32Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions · 2022-03-24T20:02:09Z

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

tg137 added the bug label Nov 9, 2021

tg137 closed this as completed Nov 23, 2021

allenporter mentioned this issue Dec 12, 2021

rook-ceph-operator in CrashLoopBackOff allenporter/k8s-gitops#445

Closed

leseb reopened this Dec 13, 2021

travisn added this to To do in v1.8 via automation Dec 13, 2021

jgilfoil added a commit to jgilfoil/k8s-gitops that referenced this issue Dec 21, 2021

disable cephfs csi

d598ac8

trying to fix operator crashes, see rook/rook#9132

jgilfoil mentioned this issue Dec 21, 2021

disable cephfs csi jgilfoil/k8s-gitops#217

Merged

travisn removed this from To do in v1.8 Jan 20, 2022

github-actions bot added the wontfix label Mar 16, 2022

github-actions bot closed this as completed Mar 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator pod restarting very frequently #9132

Operator pod restarting very frequently #9132

tg137 commented Nov 9, 2021

leseb commented Nov 9, 2021

tg137 commented Nov 9, 2021

leseb commented Nov 10, 2021

tg137 commented Nov 10, 2021

BlaineEXE commented Nov 10, 2021

tg137 commented Nov 15, 2021

BlaineEXE commented Nov 15, 2021

yuvalif commented Nov 15, 2021

BlaineEXE commented Nov 15, 2021

thotz commented Nov 23, 2021 •

edited by BlaineEXE

tg137 commented Nov 23, 2021

allenporter commented Dec 12, 2021

leseb commented Dec 13, 2021

leseb commented Dec 13, 2021

jgilfoil commented Dec 21, 2021

travisn commented Jan 5, 2022

jgilfoil commented Jan 13, 2022

BlaineEXE commented Jan 13, 2022

jgilfoil commented Jan 14, 2022 •

edited

allenporter commented Jan 15, 2022

github-actions bot commented Mar 16, 2022

github-actions bot commented Mar 24, 2022

Operator pod restarting very frequently #9132

Operator pod restarting very frequently #9132

Comments

tg137 commented Nov 9, 2021

leseb commented Nov 9, 2021

tg137 commented Nov 9, 2021

leseb commented Nov 10, 2021

tg137 commented Nov 10, 2021

BlaineEXE commented Nov 10, 2021

tg137 commented Nov 15, 2021

BlaineEXE commented Nov 15, 2021

yuvalif commented Nov 15, 2021

BlaineEXE commented Nov 15, 2021

thotz commented Nov 23, 2021 • edited by BlaineEXE

tg137 commented Nov 23, 2021

allenporter commented Dec 12, 2021

leseb commented Dec 13, 2021

leseb commented Dec 13, 2021

jgilfoil commented Dec 21, 2021

travisn commented Jan 5, 2022

jgilfoil commented Jan 13, 2022

BlaineEXE commented Jan 13, 2022

jgilfoil commented Jan 14, 2022 • edited

allenporter commented Jan 15, 2022

github-actions bot commented Mar 16, 2022

github-actions bot commented Mar 24, 2022

thotz commented Nov 23, 2021 •

edited by BlaineEXE

jgilfoil commented Jan 14, 2022 •

edited