Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph_assert error on rgw start in rook-ceph-rgw-ceph-objectstore pod #13614

Closed
sta4152 opened this issue Jan 23, 2024 · 13 comments
Closed

ceph_assert error on rgw start in rook-ceph-rgw-ceph-objectstore pod #13614

sta4152 opened this issue Jan 23, 2024 · 13 comments
Assignees

Comments

@sta4152
Copy link

sta4152 commented Jan 23, 2024

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:

After cluster restart, the rook-ceph-rgw-ceph-objectstore-xxx pod was noticed to run into CrashLoopBackOff. All crash logs since then look the same: an assertion error is reported for
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/rgw/rgw_amqp.cc: 840: FAILED ceph_assert(rc==0)

I first assumed some corruption and tried to fix with pod re-creations without and with having removed all(?) rgw/objectstore resources (pool, zones, zonegroups, etc, at least one time incl the .rgw.root pool). Symptoms never changed.

(The objectstore was un-used)

Expected behavior:

rook-ceph-rgw-ceph-objectstore-xxx starts nominally.

How to reproduce it (minimal and precise):

The cluster runs in microk8s (v1.25) cluster of 4 bare metal nodes (48 cores, 384GB ram each) with 4x 4TB ssds each inside, 2 of them on each node contain one partition configured for osds for ceph.

Ceph was installed using helm:

helm repo add rook-release https://charts.rook.io/release
helm repo update

helm upgrade --install rook-ceph \
    --namespace rook-ceph \
    --create-namespace \
    --values rook-ceph-operator-values.yaml \
    rook-release/rook-ceph

helm upgrade --install rook-ceph-cluster \
    --namespace rook-ceph \
    --create-namespace \
    --values rook-ceph-cluster-values.yaml \
    rook-release/rook-ceph-cluster

kubectl -n rook-ceph exec -it $(kubectl get pod -n rook-ceph -o name | grep ceph-tools) -- /bin/bash
ceph mgr module enable rook
ceph orch set backend rook
exit

(no idea if this triggers the problem again)

File(s) to submit:

  • Cluster CR

rook-ceph-operator-values.yaml:

csi:
  kubeletDirPath: "/var/snap/microk8s/common/var/lib/kubelet"

rook-ceph-cluster-values.yaml:

storage:
  useAllNodes: false
  useAllDevices: false
  config:
    osdsPerDevice: 4
    metadataDevice: "/dev/sdc1"
  nodes:
    - name: "hpc-srv1.xxx.yyy"
      devices:
        - name: "/dev/sdb1"
    - name: "hpc-srv2.xxx.yyy"
      devices:
        - name: "/dev/sdb1"
    - name: "hpc-srv3.xxx.yyy"
      devices:
        - name: "/dev/sdb1"
    - name: "hpc-srv4.xxx.yyy"
      devices:
        - name: "/dev/sdb1"

toolbox:
  enabled: true

Logs to submit:

Ceph crash info:

bash-4.4$ ceph crash info 2024-01-23T13:09:25.023127Z_f9c5ce06-f7c6-4a2c-9504-a3f30fa27794
{
    "assert_condition": "rc==0",
    "assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/rgw/rgw_amqp.cc",
    "assert_func": "rgw::amqp::Manager::Manager(size_t, size_t, size_t, long int, unsigned int, unsigned int, ceph::common::CephContext*)",
    "assert_line": 840,
    "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/rgw/rgw_amqp.cc: In function 'rgw::amqp::Manager::Manager(size_t, size_t, size_t, long int, unsigned int, unsigned int, ceph::common::CephContext*)' thread 7f81c5804a80 time 2024-01-23T13:09:25.016871+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/rgw/rgw_amqp.cc: 840: FAILED ceph_assert(rc==0)\n",
    "assert_thread_name": "radosgw",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12d20) [0x7f81cacc1d20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x7f81cdc09e6f]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f81cdc09fdb]",
        "(rgw::amqp::init(ceph::common::CephContext*)+0x261) [0x55e4aa4a3961]",
        "(rgw::AppMain::init_notification_endpoints()+0x38) [0x55e4a9e00d98]",
        "main()",
        "__libc_start_main()",
        "_start()"
    ],
    "ceph_version": "18.2.1",
    "crash_id": "2024-01-23T13:09:25.023127Z_f9c5ce06-f7c6-4a2c-9504-a3f30fa27794",
    "entity_name": "client.rgw.ceph.objectstore.a",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "radosgw",
    "stack_sig": "e082d3eaabfd1f88602d9630b283a2e38f24470b2a2ab68506ca57702cc1bcc2",
    "timestamp": "2024-01-23T13:09:25.023127Z",
    "utsname_hostname": "rook-ceph-rgw-ceph-objectstore-a-6c95458fb-gcb4d",
    "utsname_machine": "x86_64",
    "utsname_release": "3.10.0-1160.105.1.el7.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Thu Dec 7 15:39:45 UTC 2023"
}
  • Operator's logs

The operator show repeatedly blocks like this. With ending in failing to contact the rgw service in the end - which is expected as this did not come up:

2024-01-23 12:55:14.652586 I | ceph-spec: parsing mon endpoints: b=10.152.183.153:6789,a=10.152.183.228:6789,d=10.152.183.75:6789
2024-01-23 12:55:14.652655 I | ceph-spec: detecting the ceph image version for image quay.io/ceph/ceph:v18.2.1...
2024-01-23 12:55:18.411951 I | ceph-spec: detected ceph image version: "18.2.1-0 reef"
2024-01-23 12:55:29.360660 I | ceph-object-controller: reconciling object store deployments
2024-01-23 12:55:29.407902 I | ceph-object-controller: ceph object store gateway service running at 10.152.183.71
2024-01-23 12:55:29.407931 I | ceph-object-controller: reconciling object store pools
2024-01-23 12:55:34.581398 I | cephclient: application "rook-ceph-rgw" is already set on pool "ceph-objectstore.rgw.control"
2024-01-23 12:55:34.581429 I | cephclient: reconciling replicated pool ceph-objectstore.rgw.control succeeded
2024-01-23 12:55:38.251090 I | cephclient: creating a new crush rule for changed deviceClass on crush rule "ceph-objectstore.rgw.control_host"
2024-01-23 12:55:38.251129 I | cephclient: updating pool "ceph-objectstore.rgw.control" failure domain from "host" to "host" with new crush rule "ceph-objectsto
re.rgw.control_host"
2024-01-23 12:55:38.251143 I | cephclient: crush rule "ceph-objectstore.rgw.control_host" will no longer be used by pool "ceph-objectstore.rgw.control"
2024-01-23 12:55:46.835021 I | cephclient: Successfully updated pool "ceph-objectstore.rgw.control" failure domain to "host"
2024-01-23 12:55:46.835058 I | cephclient: setting pool property "pg_num_min" to "8" on pool "ceph-objectstore.rgw.control"
2024-01-23 12:55:54.964801 I | cephclient: application "rook-ceph-rgw" is already set on pool "ceph-objectstore.rgw.meta"
2024-01-23 12:55:54.964830 I | cephclient: reconciling replicated pool ceph-objectstore.rgw.meta succeeded
2024-01-23 12:55:58.541081 I | cephclient: creating a new crush rule for changed deviceClass on crush rule "ceph-objectstore.rgw.meta_host"
2024-01-23 12:55:58.541117 I | cephclient: updating pool "ceph-objectstore.rgw.meta" failure domain from "host" to "host" with new crush rule "ceph-objectstore.
rgw.meta_host"
2024-01-23 12:55:58.541131 I | cephclient: crush rule "ceph-objectstore.rgw.meta_host" will no longer be used by pool "ceph-objectstore.rgw.meta"
2024-01-23 12:56:02.665487 I | cephclient: Successfully updated pool "ceph-objectstore.rgw.meta" failure domain to "host"
2024-01-23 12:56:02.665524 I | cephclient: setting pool property "pg_num_min" to "8" on pool "ceph-objectstore.rgw.meta"
2024-01-23 12:56:09.834902 I | cephclient: application "rook-ceph-rgw" is already set on pool "ceph-objectstore.rgw.log"
2024-01-23 12:56:09.834935 I | cephclient: reconciling replicated pool ceph-objectstore.rgw.log succeeded
2024-01-23 12:56:16.150305 I | cephclient: creating a new crush rule for changed deviceClass on crush rule "ceph-objectstore.rgw.log_host"
2024-01-23 12:56:16.150339 I | cephclient: updating pool "ceph-objectstore.rgw.log" failure domain from "host" to "host" with new crush rule "ceph-objectstore.r
gw.log_host"
2024-01-23 12:56:16.150381 I | cephclient: crush rule "ceph-objectstore.rgw.log_host" will no longer be used by pool "ceph-objectstore.rgw.log"
2024-01-23 12:56:20.086484 I | cephclient: Successfully updated pool "ceph-objectstore.rgw.log" failure domain to "host"
2024-01-23 12:56:20.086517 I | cephclient: setting pool property "pg_num_min" to "8" on pool "ceph-objectstore.rgw.log"
2024-01-23 12:56:27.485478 I | cephclient: application "rook-ceph-rgw" is already set on pool "ceph-objectstore.rgw.buckets.index"
2024-01-23 12:56:27.485508 I | cephclient: reconciling replicated pool ceph-objectstore.rgw.buckets.index succeeded
2024-01-23 12:56:33.461033 I | cephclient: creating a new crush rule for changed deviceClass on crush rule "ceph-objectstore.rgw.buckets.index_host"
2024-01-23 12:56:33.461070 I | cephclient: updating pool "ceph-objectstore.rgw.buckets.index" failure domain from "host" to "host" with new crush rule "ceph-obj
ectstore.rgw.buckets.index_host"
2024-01-23 12:56:33.461086 I | cephclient: crush rule "ceph-objectstore.rgw.buckets.index_host" will no longer be used by pool "ceph-objectstore.rgw.buckets.ind
ex"
2024-01-23 12:56:37.714552 I | cephclient: Successfully updated pool "ceph-objectstore.rgw.buckets.index" failure domain to "host"
2024-01-23 12:56:37.714587 I | cephclient: setting pool property "pg_num_min" to "8" on pool "ceph-objectstore.rgw.buckets.index"
2024-01-23 12:56:45.148772 I | cephclient: application "rook-ceph-rgw" is already set on pool "ceph-objectstore.rgw.buckets.non-ec"
2024-01-23 12:56:45.148800 I | cephclient: reconciling replicated pool ceph-objectstore.rgw.buckets.non-ec succeeded
2024-01-23 12:56:49.453871 I | cephclient: creating a new crush rule for changed deviceClass on crush rule "ceph-objectstore.rgw.buckets.non-ec_host"
2024-01-23 12:56:49.453904 I | cephclient: updating pool "ceph-objectstore.rgw.buckets.non-ec" failure domain from "host" to "host" with new crush rule "ceph-ob
jectstore.rgw.buckets.non-ec_host"
2024-01-23 12:56:49.453919 I | cephclient: crush rule "ceph-objectstore.rgw.buckets.non-ec_host" will no longer be used by pool "ceph-objectstore.rgw.buckets.no
n-ec"
2024-01-23 12:56:57.847024 I | cephclient: Successfully updated pool "ceph-objectstore.rgw.buckets.non-ec" failure domain to "host"
2024-01-23 12:56:57.847063 I | cephclient: setting pool property "pg_num_min" to "8" on pool "ceph-objectstore.rgw.buckets.non-ec"
2024-01-23 12:57:08.037921 I | cephclient: application "rook-ceph-rgw" is already set on pool "ceph-objectstore.rgw.otp"
2024-01-23 12:57:08.037952 I | cephclient: reconciling replicated pool ceph-objectstore.rgw.otp succeeded
2024-01-23 12:57:11.664909 I | cephclient: creating a new crush rule for changed deviceClass on crush rule "ceph-objectstore.rgw.otp_host"
2024-01-23 12:57:11.664942 I | cephclient: updating pool "ceph-objectstore.rgw.otp" failure domain from "host" to "host" with new crush rule "ceph-objectstore.r
gw.otp_host"
2024-01-23 12:57:11.664956 I | cephclient: crush rule "ceph-objectstore.rgw.otp_host" will no longer be used by pool "ceph-objectstore.rgw.otp"
2024-01-23 12:57:15.710670 I | cephclient: Successfully updated pool "ceph-objectstore.rgw.otp" failure domain to "host"
2024-01-23 12:57:15.710710 I | cephclient: setting pool property "pg_num_min" to "8" on pool "ceph-objectstore.rgw.otp"
2024-01-23 12:57:24.811457 I | cephclient: reconciling replicated pool .rgw.root succeeded
2024-01-23 12:57:28.038639 I | cephclient: creating a new crush rule for changed deviceClass on crush rule ".rgw.root_host"
2024-01-23 12:57:28.038673 I | cephclient: updating pool ".rgw.root" failure domain from "host" to "host" with new crush rule ".rgw.root_host"
2024-01-23 12:57:28.038685 I | cephclient: crush rule ".rgw.root_host" will no longer be used by pool ".rgw.root"
2024-01-23 12:57:31.694532 I | cephclient: Successfully updated pool ".rgw.root" failure domain to "host"
2024-01-23 12:57:31.694568 I | cephclient: setting pool property "pg_num_min" to "8" on pool ".rgw.root"
2024-01-23 12:57:41.464564 I | cephclient: setting pool property "allow_ec_overwrites" to "true" on pool "ceph-objectstore.rgw.buckets.data"
2024-01-23 12:57:45.445573 I | cephclient: application "rook-ceph-rgw" is already set on pool "ceph-objectstore.rgw.buckets.data"
2024-01-23 12:57:45.445608 I | cephclient: creating EC pool ceph-objectstore.rgw.buckets.data succeeded
2024-01-23 12:57:45.445623 I | ceph-object-controller: setting multisite settings for object store "ceph-objectstore"
2024-01-23 12:57:47.046709 I | ceph-object-controller: there are no changes to commit for RGW configuration period for CephObjectStore "rook-ceph/ceph-objectsto
re"
2024-01-23 12:57:47.046742 I | ceph-object-controller: Multisite for object-store: realm=ceph-objectstore, zonegroup=ceph-objectstore, zone=ceph-objectstore
2024-01-23 12:57:47.046754 I | ceph-object-controller: multisite configuration for object-store ceph-objectstore is complete
2024-01-23 12:57:47.046767 I | ceph-object-controller: creating object store "ceph-objectstore" in namespace "rook-ceph"
2024-01-23 12:57:47.085345 I | cephclient: getting or creating ceph auth key "client.rgw.ceph.objectstore.a"
2024-01-23 12:57:49.288104 I | ceph-object-controller: setting rgw config flags
2024-01-23 12:57:49.288142 I | op-config: setting "client.rgw.ceph.objectstore.a"="rgw_log_nonexistent_bucket"="true" option to the mon configuration database
2024-01-23 12:57:51.155725 I | op-config: successfully set "client.rgw.ceph.objectstore.a"="rgw_log_nonexistent_bucket"="true" option to the mon configuration d
atabase
2024-01-23 12:57:51.155763 I | op-config: setting "client.rgw.ceph.objectstore.a"="rgw_log_object_name_utc"="true" option to the mon configuration database
2024-01-23 12:57:54.747863 I | op-config: successfully set "client.rgw.ceph.objectstore.a"="rgw_log_object_name_utc"="true" option to the mon configuration data
base
2024-01-23 12:57:54.747889 I | op-config: setting "client.rgw.ceph.objectstore.a"="rgw_enable_usage_log"="true" option to the mon configuration database
2024-01-23 12:57:57.366744 I | op-config: successfully set "client.rgw.ceph.objectstore.a"="rgw_enable_usage_log"="true" option to the mon configuration databas
e
2024-01-23 12:57:57.366811 I | op-config: setting "client.rgw.ceph.objectstore.a"="rgw_zone"="ceph-objectstore" option to the mon configuration database
2024-01-23 12:58:00.746514 I | op-config: successfully set "client.rgw.ceph.objectstore.a"="rgw_zone"="ceph-objectstore" option to the mon configuration databas
e
2024-01-23 12:58:00.746557 I | op-config: setting "client.rgw.ceph.objectstore.a"="rgw_zonegroup"="ceph-objectstore" option to the mon configuration database
2024-01-23 12:58:04.751537 I | op-config: successfully set "client.rgw.ceph.objectstore.a"="rgw_zonegroup"="ceph-objectstore" option to the mon configuration da
tabase
2024-01-23 12:58:04.751568 I | op-config: setting "client.rgw.ceph.objectstore.a"="rgw_run_sync_thread"="true" option to the mon configuration database
2024-01-23 12:58:08.257558 I | op-config: successfully set "client.rgw.ceph.objectstore.a"="rgw_run_sync_thread"="true" option to the mon configuration database
2024-01-23 12:58:08.257892 I | ceph-object-controller: object store "ceph-objectstore" deployment "rook-ceph-rgw-ceph-objectstore-a" created
2024-01-23 12:58:08.288756 I | ceph-object-controller: object store "ceph-objectstore" deployment "rook-ceph-rgw-ceph-objectstore-a" already exists. updating if
 needed
2024-01-23 12:58:08.305345 I | op-k8sutil: deployment "rook-ceph-rgw-ceph-objectstore-a" did not change, nothing to update
2024-01-23 12:58:08.311499 I | ceph-object-controller: config map "rook-ceph-rgw-ceph-objectstore-mime-types" for object store "ceph-objectstore" already exists
, not overwriting
2024-01-23 12:58:08.346702 I | ceph-object-controller: enabling rgw dashboard
2024-01-23 12:58:12.495601 I | ceph-object-controller: created object store "ceph-objectstore" in namespace "rook-ceph"
2024-01-23 12:58:14.287460 E | ceph-object-controller: failed to reconcile CephObjectStore "rook-ceph/ceph-objectstore". failed to create object store deploymen
ts: failed to get COSI user "cosi": Get "http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc:80/admin/user?format=json&uid=cosi": dial tcp 10.152.183.71:80: con
nect: connection refused
  • Crashing pod(s) logs

rook-ceph-rgw-ceph-objectstore-a-xxx (complete in attached log file):

debug   -767> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command assert hook 0x55e4abcf6c50
debug   -766> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command abort hook 0x55e4abcf6c50
debug   -765> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command leak_some_memory hook 0x55e4abcf6c50
debug   -764> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command perfcounters_dump hook 0x55e4abcf6c50
debug   -763> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command 1 hook 0x55e4abcf6c50
debug   -762> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command perf dump hook 0x55e4abcf6c50
debug   -761> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command perfcounters_schema hook 0x55e4abcf6c50
debug   -760> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command perf histogram dump hook 0x55e4abcf6c50
debug   -759> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command 2 hook 0x55e4abcf6c50
debug   -758> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command perf schema hook 0x55e4abcf6c50
debug   -757> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command counter dump hook 0x55e4abcf6c50
debug   -756> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command counter schema hook 0x55e4abcf6c50
debug   -755> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command perf histogram schema hook 0x55e4abcf6c50
debug   -754> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command perf reset hook 0x55e4abcf6c50
debug   -753> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command config show hook 0x55e4abcf6c50
debug   -752> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command config help hook 0x55e4abcf6c50
debug   -751> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command config set hook 0x55e4abcf6c50
debug   -750> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command config unset hook 0x55e4abcf6c50
debug   -749> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command config get hook 0x55e4abcf6c50
debug   -748> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command config diff hook 0x55e4abcf6c50
debug   -747> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command config diff get hook 0x55e4abcf6c50
debug   -746> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command injectargs hook 0x55e4abcf6c50
debug   -745> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command log flush hook 0x55e4abcf6c50
debug   -744> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command log dump hook 0x55e4abcf6c50
debug   -743> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command log reopen hook 0x55e4abcf6c50
debug   -742> 2024-01-23T13:08:55.645+0000 7f81c5804a80  5 asok(0x55e4abfda000) register_command dump_mempools hook 0x55e4aca9e068
debug   -741> 2024-01-23T13:08:55.659+0000 7f81c5804a80 10 monclient: get_monmap_and_config
debug   -740> 2024-01-23T13:08:55.659+0000 7f81c5804a80 10 monclient: build_initial_monmap
debug   -739> 2024-01-23T13:08:55.659+0000 7f81c5804a80  1 build_initial for_mkfs: 0
debug   -738> 2024-01-23T13:08:55.659+0000 7f81c5804a80 10 monclient: monmap:
epoch 0
fsid f31bf636-769f-423b-bc43-5ccf3d1197b1
last_changed 2024-01-23T13:08:55.660821+0000
created 2024-01-23T13:08:55.660821+0000
min_mon_release 0 (unknown)
election_strategy: 1
0: [v2:10.152.183.75:3300/0,v1:10.152.183.75:6789/0] mon.noname-b
1: [v2:10.152.183.153:3300/0,v1:10.152.183.153:6789/0] mon.noname-c
2: [v2:10.152.183.228:3300/0,v1:10.152.183.228:6789/0] mon.noname-a

debug   -737> 2024-01-23T13:08:55.660+0000 7f81c5804a80  5 AuthRegistry(0x55e4aca9a140) adding auth protocol: cephx
debug   -736> 2024-01-23T13:08:55.660+0000 7f81c5804a80  5 AuthRegistry(0x55e4aca9a140) adding auth protocol: cephx
debug   -735> 2024-01-23T13:08:55.660+0000 7f81c5804a80  5 AuthRegistry(0x55e4aca9a140) adding auth protocol: cephx
debug   -734> 2024-01-23T13:08:55.660+0000 7f81c5804a80  5 AuthRegistry(0x55e4aca9a140) adding con mode: secure
debug   -733> 2024-01-23T13:08:55.660+0000 7f81c5804a80  5 AuthRegistry(0x55e4aca9a140) adding con mode: crc
debug   -732> 2024-01-23T13:08:55.660+0000 7f81c5804a80  5 AuthRegistry(0x55e4aca9a140) adding con mode: secure
debug   -731> 2024-01-23T13:08:55.660+0000 7f81c5804a80  5 AuthRegistry(0x55e4aca9a140) adding con mode: crc
debug   -730> 2024-01-23T13:08:55.660+0000 7f81c5804a80  5 AuthRegistry(0x55e4aca9a140) adding con mode: secure
debug   -729> 2024-01-23T13:08:55.660+0000 7f81c5804a80  5 AuthRegistry(0x55e4aca9a140) adding con mode: crc
debug   -728> 2024-01-23T13:08:55.660+0000 7f81c5804a80  5 AuthRegistry(0x55e4aca9a140) adding con mode: secure
debug   -727> 2024-01-23T13:08:55.660+0000 7f81c5804a80  5 AuthRegistry(0x55e4aca9a140) adding con mode: crc
debug   -726> 2024-01-23T13:08:55.660+0000 7f81c5804a80  5 AuthRegistry(0x55e4aca9a140) adding con mode: secure
debug   -725> 2024-01-23T13:08:55.660+0000 7f81c5804a80  5 AuthRegistry(0x55e4aca9a140) adding con mode: crc
debug   -724> 2024-01-23T13:08:55.660+0000 7f81c5804a80  5 AuthRegistry(0x55e4aca9a140) adding con mode: secure
debug   -723> 2024-01-23T13:08:55.660+0000 7f81c5804a80  2 auth: KeyRing::load: loaded key file /etc/ceph/keyring-store/keyring
debug   -722> 2024-01-23T13:08:55.661+0000 7f81c5804a80 10 monclient: init
debug   -721> 2024-01-23T13:08:55.661+0000 7f81c5804a80  5 AuthRegistry(0x7ffcbf5e0400) adding auth protocol: cephx

...

debug    -19> 2024-01-23T13:09:24.987+0000 7f8093c91700 10 monclient: _finish_auth 0
debug    -18> 2024-01-23T13:09:24.988+0000 7f8093c91700 10 monclient: _check_auth_tickets
debug    -17> 2024-01-23T13:09:24.988+0000 7f8093c91700 10 monclient: handle_config config(11 keys) v1
debug    -16> 2024-01-23T13:09:24.988+0000 7f8093c91700 10 monclient: handle_monmap mon_map magic: 0 v1
debug    -15> 2024-01-23T13:09:24.988+0000 7f8093c91700 10 monclient:  got monmap 6 from mon.b (according to old e6)
debug    -14> 2024-01-23T13:09:24.988+0000 7f8095494700  4 set_mon_vals no callback set
debug    -13> 2024-01-23T13:09:24.988+0000 7f8093c91700 10 monclient: dump:
epoch 6
fsid f31bf636-769f-423b-bc43-5ccf3d1197b1
last_changed 2024-01-21T22:56:22.019951+0000
created 2024-01-11T20:17:19.172397+0000
min_mon_release 18 (reef)
election_strategy: 1
0: [v2:10.152.183.228:3300/0,v1:10.152.183.228:6789/0] mon.a
1: [v2:10.152.183.153:3300/0,v1:10.152.183.153:6789/0] mon.b
2: [v2:10.152.183.75:3300/0,v1:10.152.183.75:6789/0] mon.d

debug    -12> 2024-01-23T13:09:24.988+0000 7f81c5804a80  5 monclient: authenticate success, global_id 1778332
debug    -11> 2024-01-23T13:09:24.988+0000 7f81c5804a80 10 monclient: _renew_subs
debug    -10> 2024-01-23T13:09:24.988+0000 7f81c5804a80 10 monclient: _send_mon_message to mon.b at v2:10.152.183.153:3300/0
debug     -9> 2024-01-23T13:09:24.988+0000 7f81c5804a80 10 monclient: _renew_subs
debug     -8> 2024-01-23T13:09:24.988+0000 7f81c5804a80 10 monclient: _send_mon_message to mon.b at v2:10.152.183.153:3300/0
debug     -7> 2024-01-23T13:09:24.988+0000 7f81c5804a80  1 librados: init done
debug     -6> 2024-01-23T13:09:24.992+0000 7f8093c91700  4 mgrc handle_mgr_map Got map version 842
debug     -5> 2024-01-23T13:09:24.992+0000 7f8093c91700  4 mgrc handle_mgr_map Active mgr is now [v2:10.1.23.96:6800/3090971858,v1:10.1.23.96:6801/3090971858]
debug     -4> 2024-01-23T13:09:24.992+0000 7f8093c91700  4 mgrc reconnect Starting new session with [v2:10.1.23.96:6800/3090971858,v1:10.1.23.96:6801/3090971858
]
debug     -3> 2024-01-23T13:09:24.992+0000 7f81c2b91700 10 monclient: get_auth_request con 0x55e4af1ca000 auth_method 0
debug     -2> 2024-01-23T13:09:24.993+0000 7f81c3b93700 10 monclient: get_auth_request con 0x55e4ace2a000 auth_method 0
debug     -1> 2024-01-23T13:09:25.019+0000 7f81c5804a80 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cento
s8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/rgw/rgw_amqp.cc: In function 'rgw::amqp::Manager::Manager(size_t, size_t, siz
e_t, long int, unsigned int, unsigned int, ceph::common::CephContext*)' thread 7f81c5804a80 time 2024-01-23T13:09:25.016871+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el
8/BUILD/ceph-18.2.1/src/rgw/rgw_amqp.cc: 840: FAILED ceph_assert(rc==0)

 ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f81cdc09e15]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f81cdc09fdb]
 3: (rgw::amqp::init(ceph::common::CephContext*)+0x261) [0x55e4aa4a3961]
 4: (rgw::AppMain::init_notification_endpoints()+0x38) [0x55e4a9e00d98]
 5: main()
 6: __libc_start_main()
 7: _start()

debug      0> 2024-01-23T13:09:25.022+0000 7f81c5804a80 -1 *** Caught signal (Aborted) **
 in thread 7f81c5804a80 thread_name:radosgw

 ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
 1: /lib64/libpthread.so.0(+0x12d20) [0x7f81cacc1d20]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x7f81cdc09e6f]
 5: /usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f81cdc09fdb]
 6: (rgw::amqp::init(ceph::common::CephContext*)+0x261) [0x55e4aa4a3961]
 7: (rgw::AppMain::init_notification_endpoints()+0x38) [0x55e4a9e00d98]
 8: main()
 9: __libc_start_main()
 10: _start()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_pwl
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 immutable_obj_cache
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/ 5 rgw_datacache
   1/ 5 rgw_access
   1/ 5 rgw_dbstore
   1/ 5 rgw_flight
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 fuse
   2/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
   0/ 5 cephfs_mirror
   0/ 5 cephsqlite
   0/ 5 seastore
   0/ 5 seastore_onode
   0/ 5 seastore_odata
   0/ 5 seastore_omap
   0/ 5 seastore_tm
   0/ 5 seastore_t
   0/ 5 seastore_cleaner
   0/ 5 seastore_epm
   0/ 5 seastore_lba
   0/ 5 seastore_fixedkv_tree
   0/ 5 seastore_cache
   0/ 5 seastore_journal
   0/ 5 seastore_device
   0/ 5 seastore_backref
   0/ 5 alienstore
   1/ 5 mclock
   0/ 5 cyanstore
   1/ 5 ceph_exporter
   1/ 5 memstore
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  7f8093c91700 / ms_dispatch
  7f8095494700 / io_context_pool
  7f819bb43700 / lifecycle_thr_2
  7f819db47700 / lifecycle_thr_1
  7f819fb4b700 / lifecycle_thr_0
  7f81a6358700 / rgw_obj_expirer
  7f81a6b59700 / rgw_gc
  7f81a8b5d700 / ms_dispatch
  7f81aa360700 / io_context_pool
  7f81aab61700 / rgw_dt_lg_renew
  7f81bbb83700 / safe_timer
  7f81bcb85700 / ms_dispatch
  7f81bd386700 / ceph_timer
  7f81be388700 / io_context_pool
  7f81c1b8f700 / admin_socket
  7f81c2390700 / service
  7f81c2b91700 / msgr-worker-2
  7f81c3392700 / msgr-worker-1
  7f81c3b93700 / msgr-worker-0
  7f81c5804a80 / radosgw
  max_recent     10000
  max_new         1000
  log_file /var/lib/ceph/crash/2024-01-23T13:09:25.023127Z_f9c5ce06-f7c6-4a2c-9504-a3f30fa27794/log
--- end dump of recent events ---

Cluster Status to submit:

  cluster:
    id:     f31bf636-769f-423b-bc43-5ccf3d1197b1
    health: HEALTH_WARN
            739 daemons have recently crashed
            11 mgr modules have recently crashed

  services:
    mon: 3 daemons, quorum a,b,d (age 8h)
    mgr: a(active, since 2h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 8 osds: 8 up (since 8h), 8 in (since 8d)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   15 pools, 281 pgs
    objects: 31.02k objects, 8.3 GiB
    usage:   26 GiB used, 3.5 TiB / 3.5 TiB avail
    pgs:     281 active+clean

  io:
    client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
  • Output of kubectl commands, if necessary

Environment:

  • OS: CentOS 7.9
  • Kernel: 3.10.0-1160.105.1.el7.x86_64 #1 SMP Thu Dec 7 15:39:45 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Cloud provider or hardware configuration: bare metal, see above
  • Rook version: rook: v1.13.2 / go: go1.21.5
  • Storage backend version: ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
  • Kubernetes version: Client Version: v1.25.16 / Kustomize Version: v4.5.7 / Server Version: v1.25.16
  • Kubernetes cluster type: microk8s v1.25.16 revision 6254
@sta4152 sta4152 added the bug label Jan 23, 2024
@sta4152
Copy link
Author

sta4152 commented Jan 23, 2024

Lines 839:840 of rgw_amqp.cpp:

      const auto rc = ceph_pthread_setname(runner.native_handle(), "amqp_manager");
      ceph_assert(rc==0);

If I'm not completely wrong, then setting thread's name results in non-zero return code (AFAIS only possible with compat.h provided implementation)?
Why ... how ... ?
And if so: why did it ever work (initially the problematic pod started ok)

@travisn
Copy link
Member

travisn commented Jan 24, 2024

For the rgw crash, a Ceph tracker would be the right place to open the issue so the core rgw team can take a look.

@subhamkrai Could you take a look at these messages? It seems odd that the failure domain is being updated from "host" to "host".

2024-01-23 12:56:16.150339 I | cephclient: updating pool "ceph-objectstore.rgw.log" failure domain from "host" to "host" with new crush rule "ceph-objectstore.rgw.log_host"

@thotz
Copy link
Contributor

thotz commented Jan 25, 2024

Lines 839:840 of rgw_amqp.cpp:

      const auto rc = ceph_pthread_setname(runner.native_handle(), "amqp_manager");
      ceph_assert(rc==0);

If I'm not completely wrong, then setting thread's name results in non-zero return code (AFAIS only possible with compat.h provided implementation)? Why ... how ... ? And if so: why did it ever work (initially the problematic pod started ok)

@yuvalif have u seen any similar issues in the ceph v18 ??

@yuvalif
Copy link
Contributor

yuvalif commented Jan 25, 2024

ceph_pthread_setname

ceph_pthread_setname() is just a wrapper around pthread_setname_np(). which returns non-zero only if the provided thread name is more than 15 bytes (+/0).
"amqp_manager" is 12 bytes, so i don't see why it would fail.

@sta4152
Copy link
Author

sta4152 commented Jan 25, 2024 via email

@sta4152
Copy link
Author

sta4152 commented Feb 2, 2024

I finally got access and reported there, too: https://tracker.ceph.com/issues/64305

@yuvalif
Copy link
Contributor

yuvalif commented Feb 5, 2024

I finally got access and reported there, too: https://tracker.ceph.com/issues/64305

thanks! will look into that

@mh013370
Copy link

mh013370 commented Feb 12, 2024

I'm also seeing this with one of our clusters. I'd upgraded from rook 1.12.8 to 1.13.3 and ceph 17.2.6 to 18.2.1. Oddly, i'd upgraded 2 other clusters with the same configuration and they all went smoothly. The third cluster hit this case and i can't seem to get it out. We don't use the objectstore on this cluster, so I've completely deleted it and re-created but the error still occurs. The rest of the rook-ceph cluster deployment appears healthy. Only the objectstore pod is crash looping.

This k8s cluster only has 1 control plane, where the other two had 3 control planes. Perhaps that's related? All are rke2 k8s 1.27

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

Copy link

github-actions bot commented May 1, 2024

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 1, 2024
@mh013370
Copy link

This still happens with rook 1.14.3 and ceph 18.2.2. I have only seen it happen on clusters with 1 control plane.

@yuvalif
Copy link
Contributor

yuvalif commented May 10, 2024

This still happens with rook 1.14.3 and ceph 18.2.2. I have only seen it happen on clusters with 1 control plane.

which OS version is being used there?

@mh013370
Copy link

This still happens with rook 1.14.3 and ceph 18.2.2. I have only seen it happen on clusters with 1 control plane.

which OS version is being used there?

In my case, centos7 with kernel version 3.10.0 (I know...). What's odd is that we don't see this in Kubernetes clusters with multiple (3+) control planes. It has happened on 2 small clusters with a single control plane.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants