New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ceph_assert error on rgw start in rook-ceph-rgw-ceph-objectstore pod #13614
Comments
Lines 839:840 of
If I'm not completely wrong, then setting thread's name results in non-zero return code (AFAIS only possible with |
For the rgw crash, a Ceph tracker would be the right place to open the issue so the core rgw team can take a look. @subhamkrai Could you take a look at these messages? It seems odd that the failure domain is being updated from "host" to "host".
|
@yuvalif have u seen any similar issues in the ceph v18 ?? |
|
This is exactly what was my finding, too. As the name to be set is a fixed one, definitely not exceeding the length, the
implementation of `ceph_pthread_setname`, afais provided by `compat`, must return non-zero code also for different reasons that are not described in the (posix?) interface/spec. Unfortunately I failed to find the related source code to check for this,
@travisn: yes, of course. I tried to submit to ceph, too. But it seems that they are fighting spam. My users list post and a tracker user account are awaiting (manual) admin approvals. I will link then.
|
I finally got access and reported there, too: https://tracker.ceph.com/issues/64305 |
thanks! will look into that |
I'm also seeing this with one of our clusters. I'd upgraded from rook 1.12.8 to 1.13.3 and ceph 17.2.6 to 18.2.1. Oddly, i'd upgraded 2 other clusters with the same configuration and they all went smoothly. The third cluster hit this case and i can't seem to get it out. We don't use the objectstore on this cluster, so I've completely deleted it and re-created but the error still occurs. The rest of the rook-ceph cluster deployment appears healthy. Only the objectstore pod is crash looping. This k8s cluster only has 1 control plane, where the other two had 3 control planes. Perhaps that's related? All are rke2 k8s 1.27 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation. |
This still happens with rook 1.14.3 and ceph 18.2.2. I have only seen it happen on clusters with 1 control plane. |
which OS version is being used there? |
In my case, centos7 with kernel version 3.10.0 (I know...). What's odd is that we don't see this in Kubernetes clusters with multiple (3+) control planes. It has happened on 2 small clusters with a single control plane. |
Is this a bug report or feature request?
Deviation from expected behavior:
After cluster restart, the rook-ceph-rgw-ceph-objectstore-xxx pod was noticed to run into CrashLoopBackOff. All crash logs since then look the same: an assertion error is reported for
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/rgw/rgw_amqp.cc: 840: FAILED ceph_assert(rc==0)
I first assumed some corruption and tried to fix with pod re-creations without and with having removed all(?) rgw/objectstore resources (pool, zones, zonegroups, etc, at least one time incl the
.rgw.root
pool). Symptoms never changed.(The objectstore was un-used)
Expected behavior:
rook-ceph-rgw-ceph-objectstore-xxx
starts nominally.How to reproduce it (minimal and precise):
The cluster runs in microk8s (v1.25) cluster of 4 bare metal nodes (48 cores, 384GB ram each) with 4x 4TB ssds each inside, 2 of them on each node contain one partition configured for osds for ceph.
Ceph was installed using helm:
(no idea if this triggers the problem again)
File(s) to submit:
rook-ceph-operator-values.yaml:
rook-ceph-cluster-values.yaml:
Logs to submit:
Ceph crash info:
The operator show repeatedly blocks like this. With ending in failing to contact the rgw service in the end - which is expected as this did not come up:
rook-ceph-rgw-ceph-objectstore-a-xxx (complete in attached log file):
Cluster Status to submit:
Environment:
7.9
3.10.0-1160.105.1.el7.x86_64 #1 SMP Thu Dec 7 15:39:45 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
bare metal, see above
rook: v1.13.2 / go: go1.21.5
ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
Client Version: v1.25.16 / Kustomize Version: v4.5.7 / Server Version: v1.25.16
microk8s v1.25.16 revision 6254
The text was updated successfully, but these errors were encountered: