Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator pod restarting very frequently #9132

Closed
tg137 opened this issue Nov 9, 2021 · 22 comments
Closed

Operator pod restarting very frequently #9132

tg137 opened this issue Nov 9, 2021 · 22 comments

Comments

@tg137
Copy link

tg137 commented Nov 9, 2021

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:

Setting up Rook on one node with one OSD using crds.yaml, common.yaml, operator.yaml and cluster-test.yaml from the current master branch causes regular failures of the operator pod with the following error:

2021-11-09 14:12:58.663748 C | rookcmd: failed to run operator: gave up to run the operator manager: failed to run the controller-runtime manager: failed to wait for ceph-bucket-notification-controller caches to sync: timed out waiting for cache to be synced

The pod then exits and is restarted. The pod runs successfully for a couple of minutes after restarting but then exits again with the above error.

Whilst the operator pod is running (i.e after setup and before it crashes), I am able to provision storage. Whilst it is crashed, I am not able to.

Expected behavior:

To reproduce, follow the instructions outlined here with the exception of creating cluster-test.yaml instead of cluster.yaml at the end.

Expected to have a successful Rook deployment including an operator that is able to run for more than 5 minutes without restarting.

File(s) to submit:

Environment:

  • OS (e.g. from /etc/os-release):
NAME=NixOS
ID=nixos
VERSION="21.05.3916.3b1789322fc (Okapi)"
  • Kernel (e.g. uname -a):
Linux nixos 5.10.75 #1-NixOS SMP Wed Oct 20 09:45:06 UTC 2021 x86_64 GNU/Linux
  • Cloud provider or hardware configuration: Running locally in a single-node K3s Kubernetes
  • Rook version (use rook version inside of a Rook Pod):
rook: v1.7.0-alpha.0.589.g8cfaedc
go: go1.16.7
  • Storage backend version (e.g. for ceph do ceph -v):
ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"archive", BuildDate:"1980-01-01T00:00:00Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0+k3s1", GitCommit:"2705431d9645d128441c578309574cd262285ae6", GitTreeState:"clean", BuildDate:"2021-09-14T00:00:00Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):
    HEALTH_OK
@tg137 tg137 added the bug label Nov 9, 2021
@leseb
Copy link
Member

leseb commented Nov 9, 2021

Are the env resources constrained?

@tg137
Copy link
Author

tg137 commented Nov 9, 2021

Not sure if this is the answer you're looking for but Rook has access to a single partition which has 10Gi of free storage. None of that 10Gi is being used in the current setup.

Apologies if you were talking about a different kind of resource.

@leseb
Copy link
Member

leseb commented Nov 10, 2021

No worries, I was more thinking of CPU and RAM actually. If the operator does not have many resources available to run then we might likely reach timeouts like this.

@tg137
Copy link
Author

tg137 commented Nov 10, 2021

Aha, OK, sorry about that.

The Kubernetes cluster has access to 64Gi of RAM and 8 CPU cores.

Not sure if it helps, but the k3s logs at the time of the pod erroring are as follows:

Nov 09 14:12:51 nixos k3s[206199]: time="2021-11-09T14:12:51.014813773Z" level=info msg="Cluster-Http-Server 2021/11/09 14:12:51 http: TLS handshake error from 10.42.0.85:45722: EOF"
Nov 09 14:12:59 nixos k3s[206199]: I1109 14:12:59.038636  206199 scope.go:111] "RemoveContainer" containerID="20db35b76efc4fba439af3be455eaa7af9bcd2566111524485667c4deeb0b1d2"
Nov 09 14:12:59 nixos k3s[206199]: I1109 14:12:59.038881  206199 scope.go:111] "RemoveContainer" containerID="1c00f752fa9728b01e10a6efde168c23814da41c3218389ec5466866f3fc7e92"
Nov 09 14:12:59 nixos k3s[206199]: E1109 14:12:59.039100  206199 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"rook-ceph-operator\" with CrashLoopBackOff: \"back-off 2m40s restarting failed container=rook-ceph-operator pod=rook-ce>
Nov 09 14:12:59 nixos k3s[206199]: I1109 14:12:59.666424  206199 trace.go:205] Trace[375646972]: "Get" url:/api/v1/namespaces/rook-ceph/pods/rook-ceph-operator-76dc868c4b-d54ls/log,user-agent:kubectl/v1.21.1 (linux/amd64) kubernetes/5e58841,client:127.0.0.1,accept:applicati>
Nov 09 14:12:59 nixos k3s[206199]: Trace[375646972]: ---"Transformed response object" 34276ms (14:12:00.666)
Nov 09 14:12:59 nixos k3s[206199]: Trace[375646972]: [34.278046136s] [34.278046136s] END

@BlaineEXE
Copy link
Member

Is the error always related to ceph-bucket-notification-controller?

@tg137
Copy link
Author

tg137 commented Nov 15, 2021

Apologies for the late reply, I had a long weekend away from the computer.

From what I've seen, the error is always related to ceph-bucket-notification-controller.

@BlaineEXE
Copy link
Member

I'm not sure what would be causing this. @yuvalif have you seen anything like this during your development?

@yuvalif
Copy link
Contributor

yuvalif commented Nov 15, 2021

I'm not sure what would be causing this. @yuvalif have you seen anything like this during your development?

i haven't seen that. but i was testing only with a small number of OBCs.
will try to reproduce with a large number

@BlaineEXE
Copy link
Member

@tg137 could you confirm how many ObjectBucketClaims (OBCs) you have created in your test cluster?

@thotz
Copy link
Contributor

thotz commented Nov 23, 2021

@tg137 : can you also check whether rook cluster have the crds for bucket notification or topics. Or u will be using the latest crds from 1.7 branch but ur rook version is below than 1.7.6 which does not these controllers

@tg137
Copy link
Author

tg137 commented Nov 23, 2021

Sorry all, I have torn down this cluster now. Thanks for trying to help but it sounds like it was potentially just a me problem. I will close for now and if I revisit this in the future I will re-open with the info you're asking for.

@allenporter
Copy link

I would recommend re-opening.

I had this same issue when upgrading the helm chart from 1.7.9 to 1.8.0. My impression is that the helm chart is handling CRD upgrades.

Happy to provide additional details, though i don't know how to find out # of ObjectBucketClaims for example so could use some guidance there.

@leseb
Copy link
Member

leseb commented Dec 13, 2021

Re-opening, I'm also seeing this here #9384

@leseb leseb reopened this Dec 13, 2021
@travisn travisn added this to To do in v1.8 via automation Dec 13, 2021
@leseb
Copy link
Member

leseb commented Dec 13, 2021

I might have a repro when the scheme is missing an object. For instance, I was missing CephFilesystemSubVolumeGroupList.

jgilfoil added a commit to jgilfoil/k8s-gitops that referenced this issue Dec 21, 2021
trying to fix operator crashes, see rook/rook#9132
@jgilfoil
Copy link

Upon upgrading from 1.7.9 to 1.8.1, I'm encountering this issue as well.

File(s) to submit:

Environment:

  • OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
VERSION_ID="20.04"
  • Kernel (e.g. uname -a):
Linux odroid-01 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Cloud provider or hardware configuration: Running locally on 3x x86 SBC's (32gb memory, 4 cores each)
  • Rook version (use rook version inside of a Rook Pod):
rook: v1.8.1
go: go1.16.12
  • Storage backend version (e.g. for ceph do ceph -v):
ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)
  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:10:45Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.8+k3s1", GitCommit:"cbff7350ecb68bf3070deb30f55611a37c42f470", GitTreeState:"clean", BuildDate:"2021-12-18T01:11:18Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):
  cluster:
    id:     78b91dcd-c6f6-416d-8a9d-63d9d3ed62be
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 10h)
    mgr: a(active, since 11h)
    osd: 3 osds: 3 up (since 11h), 3 in (since 4w)

  data:
    pools:   9 pools, 120 pgs
    objects: 25.83k objects, 99 GiB
    usage:   297 GiB used, 2.4 TiB / 2.7 TiB avail
    pgs:     120 active+clean

  io:
    client:   6.7 KiB/s wr, 0 op/s rd, 0 op/s wr

Is the error always related to ceph-bucket-notification-controller?

Yes, I believe so, it's definitely consistent since i started looking for it late last night and into this morning. ( I upgraded yesterday evening, which is when this started as far as I know)

could you confirm how many ObjectBucketClaims (OBCs) you have created in your test cluster?

I think zero, is this what you mean by that?

$ kubectl get ObjectBucketClaims --all-namespaces
No resources found

Let me know if I can provide any further information or debugging steps to try. I may roll this back to 1.7.9 just to get things back into a good state, but i'll leave it as is for a few days and can always roll forward again to 1.8.1 or newer if needed.

anthr76 added a commit to anthr76/infra that referenced this issue Jan 5, 2022
…532)"

This reverts commit eebccd5.

rook/rook#9132
allenporter/k8s-gitops#445

Signed-off-by: Anthony Rabbito <hello@anthonyrabbito.com>
@travisn
Copy link
Member

travisn commented Jan 5, 2022

Discussing with @leseb, we cannot repro in the latest master, it may be fixed from #9384 with v1.8.2 (targeted for tomorrow).

@BlaineEXE
Copy link
Member

@jgilfoil would it be possible to get gists of the CRDs (kubectl describe crds) and the operator logs with ROOK_LOG_LEVEL=DEBUG. I see in your case it's the BucketTopic that is having trouble sync-ing, but I'm not sure why that would be. I'm wondering if there could be some issue with the new CRDs or with something in the new controllers that I hope this info can help us troubleshoot.

@jgilfoil
Copy link

jgilfoil commented Jan 14, 2022

@BlaineEXE thanks for giving this attention. Based on your hint to look at the crds, i did some comparing and it seems the crds in my system aren't getting updated as expected to the versions for v1.8.x. I expected flux to pull them in after updating this line, but apparently that's not happening properly, so I need to go debug that process. I have the debug logs from the operator, and in case the crd issue is a red herring, i'll update here, but I'm thinking this probably isn't an issue with rook, for me at least. Thanks for your help!

@allenporter
Copy link

Thanks @jgilfoil, my issue was also about crds not being upgraded, so my fault. I had crds disabled in the helm chart and forgot about it when reading the upgrade instructions, and didn't get the hint based on the error messages.

I'm good with marking resolved also.

@travisn travisn removed this from To do in v1.8 Jan 20, 2022
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@github-actions
Copy link

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants