Bug #64305
openceph_assert error on rgw start in rook-ceph-rgw-ceph-objectstore pod
0%
Description
ceph is deployed within our cluster using rook.
After cluster restart, the rook-ceph-rgw-ceph-objectstore-xxx pod was noticed to run into CrashLoopBackOff. All crash logs since then look the same: an assertion error is reported for
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/rgw/rgw_amqp.cc: 840: FAILED ceph_assert(rc==0)
I first assumed some corruption and tried to fix with pod re-creations without and with having removed all(?) rgw/objectstore resources (pool, zones, zonegroups, etc, at least one time incl the .rgw.root pool). Symptoms never changed.
(The objectstore was un-used)
The strange thing is, that afais the referenced location deals with setting the current thread's name (hard coded one and not too long), that according to method description should not return non-zero. But I failed to find the implementing sources of the compat (?) library providing the used method's implementation.
This issue is reported to took, too, without any hint so far in what direction to have a closer look to:
https://github.com/rook/rook/issues/13614
The cluster is still behaving the same way and regularly producing those crashes.
ceph crash info:
bash-4.4$ ceph crash info 2024-01-23T13:09:25.023127Z_f9c5ce06-f7c6-4a2c-9504-a3f30fa27794 { "assert_condition": "rc==0", "assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/rgw/rgw_amqp.cc", "assert_func": "rgw::amqp::Manager::Manager(size_t, size_t, size_t, long int, unsigned int, unsigned int, ceph::common::CephContext*)", "assert_line": 840, "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/rgw/rgw_amqp.cc: In function 'rgw::amqp::Manager::Manager(size_t, size_t, size_t, long int, unsigned int, unsigned int, ceph::common::CephContext*)' thread 7f81c5804a80 time 2024-01-23T13:09:25.016871+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/rgw/rgw_amqp.cc: 840: FAILED ceph_assert(rc==0)\n", "assert_thread_name": "radosgw", "backtrace": [ "/lib64/libpthread.so.0(+0x12d20) [0x7f81cacc1d20]", "gsignal()", "abort()", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x7f81cdc09e6f]", "/usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f81cdc09fdb]", "(rgw::amqp::init(ceph::common::CephContext*)+0x261) [0x55e4aa4a3961]", "(rgw::AppMain::init_notification_endpoints()+0x38) [0x55e4a9e00d98]", "main()", "__libc_start_main()", "_start()" ], "ceph_version": "18.2.1", "crash_id": "2024-01-23T13:09:25.023127Z_f9c5ce06-f7c6-4a2c-9504-a3f30fa27794", "entity_name": "client.rgw.ceph.objectstore.a", "os_id": "centos", "os_name": "CentOS Stream", "os_version": "8", "os_version_id": "8", "process_name": "radosgw", "stack_sig": "e082d3eaabfd1f88602d9630b283a2e38f24470b2a2ab68506ca57702cc1bcc2", "timestamp": "2024-01-23T13:09:25.023127Z", "utsname_hostname": "rook-ceph-rgw-ceph-objectstore-a-6c95458fb-gcb4d", "utsname_machine": "x86_64", "utsname_release": "3.10.0-1160.105.1.el7.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP Thu Dec 7 15:39:45 UTC 2023" }
rook-ceph-rgw-ceph-objectstore-a-xxx log:
debug -767> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command assert hook 0x55e4abcf6c50 debug -766> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command abort hook 0x55e4abcf6c50 debug -765> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command leak_some_memory hook 0x55e4abcf6c50 debug -764> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command perfcounters_dump hook 0x55e4abcf6c50 debug -763> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command 1 hook 0x55e4abcf6c50 debug -762> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command perf dump hook 0x55e4abcf6c50 debug -761> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command perfcounters_schema hook 0x55e4abcf6c50 debug -760> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command perf histogram dump hook 0x55e4abcf6c50 debug -759> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command 2 hook 0x55e4abcf6c50 debug -758> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command perf schema hook 0x55e4abcf6c50 debug -757> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command counter dump hook 0x55e4abcf6c50 debug -756> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command counter schema hook 0x55e4abcf6c50 debug -755> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command perf histogram schema hook 0x55e4abcf6c50 debug -754> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command perf reset hook 0x55e4abcf6c50 debug -753> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command config show hook 0x55e4abcf6c50 debug -752> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command config help hook 0x55e4abcf6c50 debug -751> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command config set hook 0x55e4abcf6c50 debug -750> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command config unset hook 0x55e4abcf6c50 debug -749> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command config get hook 0x55e4abcf6c50 debug -748> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command config diff hook 0x55e4abcf6c50 debug -747> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command config diff get hook 0x55e4abcf6c50 debug -746> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command injectargs hook 0x55e4abcf6c50 debug -745> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command log flush hook 0x55e4abcf6c50 debug -744> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command log dump hook 0x55e4abcf6c50 debug -743> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command log reopen hook 0x55e4abcf6c50 debug -742> 2024-01-23T13:08:55.645+0000 7f81c5804a80 5 asok(0x55e4abfda000) register_command dump_mempools hook 0x55e4aca9e068 debug -741> 2024-01-23T13:08:55.659+0000 7f81c5804a80 10 monclient: get_monmap_and_config debug -740> 2024-01-23T13:08:55.659+0000 7f81c5804a80 10 monclient: build_initial_monmap debug -739> 2024-01-23T13:08:55.659+0000 7f81c5804a80 1 build_initial for_mkfs: 0 debug -738> 2024-01-23T13:08:55.659+0000 7f81c5804a80 10 monclient: monmap: epoch 0 fsid f31bf636-769f-423b-bc43-5ccf3d1197b1 last_changed 2024-01-23T13:08:55.660821+0000 created 2024-01-23T13:08:55.660821+0000 min_mon_release 0 (unknown) election_strategy: 1 0: [v2:10.152.183.75:3300/0,v1:10.152.183.75:6789/0] mon.noname-b 1: [v2:10.152.183.153:3300/0,v1:10.152.183.153:6789/0] mon.noname-c 2: [v2:10.152.183.228:3300/0,v1:10.152.183.228:6789/0] mon.noname-a debug -737> 2024-01-23T13:08:55.660+0000 7f81c5804a80 5 AuthRegistry(0x55e4aca9a140) adding auth protocol: cephx debug -736> 2024-01-23T13:08:55.660+0000 7f81c5804a80 5 AuthRegistry(0x55e4aca9a140) adding auth protocol: cephx debug -735> 2024-01-23T13:08:55.660+0000 7f81c5804a80 5 AuthRegistry(0x55e4aca9a140) adding auth protocol: cephx debug -734> 2024-01-23T13:08:55.660+0000 7f81c5804a80 5 AuthRegistry(0x55e4aca9a140) adding con mode: secure debug -733> 2024-01-23T13:08:55.660+0000 7f81c5804a80 5 AuthRegistry(0x55e4aca9a140) adding con mode: crc debug -732> 2024-01-23T13:08:55.660+0000 7f81c5804a80 5 AuthRegistry(0x55e4aca9a140) adding con mode: secure debug -731> 2024-01-23T13:08:55.660+0000 7f81c5804a80 5 AuthRegistry(0x55e4aca9a140) adding con mode: crc debug -730> 2024-01-23T13:08:55.660+0000 7f81c5804a80 5 AuthRegistry(0x55e4aca9a140) adding con mode: secure debug -729> 2024-01-23T13:08:55.660+0000 7f81c5804a80 5 AuthRegistry(0x55e4aca9a140) adding con mode: crc debug -728> 2024-01-23T13:08:55.660+0000 7f81c5804a80 5 AuthRegistry(0x55e4aca9a140) adding con mode: secure debug -727> 2024-01-23T13:08:55.660+0000 7f81c5804a80 5 AuthRegistry(0x55e4aca9a140) adding con mode: crc debug -726> 2024-01-23T13:08:55.660+0000 7f81c5804a80 5 AuthRegistry(0x55e4aca9a140) adding con mode: secure debug -725> 2024-01-23T13:08:55.660+0000 7f81c5804a80 5 AuthRegistry(0x55e4aca9a140) adding con mode: crc debug -724> 2024-01-23T13:08:55.660+0000 7f81c5804a80 5 AuthRegistry(0x55e4aca9a140) adding con mode: secure debug -723> 2024-01-23T13:08:55.660+0000 7f81c5804a80 2 auth: KeyRing::load: loaded key file /etc/ceph/keyring-store/keyring debug -722> 2024-01-23T13:08:55.661+0000 7f81c5804a80 10 monclient: init debug -721> 2024-01-23T13:08:55.661+0000 7f81c5804a80 5 AuthRegistry(0x7ffcbf5e0400) adding auth protocol: cephx ... debug -19> 2024-01-23T13:09:24.987+0000 7f8093c91700 10 monclient: _finish_auth 0 debug -18> 2024-01-23T13:09:24.988+0000 7f8093c91700 10 monclient: _check_auth_tickets debug -17> 2024-01-23T13:09:24.988+0000 7f8093c91700 10 monclient: handle_config config(11 keys) v1 debug -16> 2024-01-23T13:09:24.988+0000 7f8093c91700 10 monclient: handle_monmap mon_map magic: 0 v1 debug -15> 2024-01-23T13:09:24.988+0000 7f8093c91700 10 monclient: got monmap 6 from mon.b (according to old e6) debug -14> 2024-01-23T13:09:24.988+0000 7f8095494700 4 set_mon_vals no callback set debug -13> 2024-01-23T13:09:24.988+0000 7f8093c91700 10 monclient: dump: epoch 6 fsid f31bf636-769f-423b-bc43-5ccf3d1197b1 last_changed 2024-01-21T22:56:22.019951+0000 created 2024-01-11T20:17:19.172397+0000 min_mon_release 18 (reef) election_strategy: 1 0: [v2:10.152.183.228:3300/0,v1:10.152.183.228:6789/0] mon.a 1: [v2:10.152.183.153:3300/0,v1:10.152.183.153:6789/0] mon.b 2: [v2:10.152.183.75:3300/0,v1:10.152.183.75:6789/0] mon.d debug -12> 2024-01-23T13:09:24.988+0000 7f81c5804a80 5 monclient: authenticate success, global_id 1778332 debug -11> 2024-01-23T13:09:24.988+0000 7f81c5804a80 10 monclient: _renew_subs debug -10> 2024-01-23T13:09:24.988+0000 7f81c5804a80 10 monclient: _send_mon_message to mon.b at v2:10.152.183.153:3300/0 debug -9> 2024-01-23T13:09:24.988+0000 7f81c5804a80 10 monclient: _renew_subs debug -8> 2024-01-23T13:09:24.988+0000 7f81c5804a80 10 monclient: _send_mon_message to mon.b at v2:10.152.183.153:3300/0 debug -7> 2024-01-23T13:09:24.988+0000 7f81c5804a80 1 librados: init done debug -6> 2024-01-23T13:09:24.992+0000 7f8093c91700 4 mgrc handle_mgr_map Got map version 842 debug -5> 2024-01-23T13:09:24.992+0000 7f8093c91700 4 mgrc handle_mgr_map Active mgr is now [v2:10.1.23.96:6800/3090971858,v1:10.1.23.96:6801/3090971858] debug -4> 2024-01-23T13:09:24.992+0000 7f8093c91700 4 mgrc reconnect Starting new session with [v2:10.1.23.96:6800/3090971858,v1:10.1.23.96:6801/3090971858 ] debug -3> 2024-01-23T13:09:24.992+0000 7f81c2b91700 10 monclient: get_auth_request con 0x55e4af1ca000 auth_method 0 debug -2> 2024-01-23T13:09:24.993+0000 7f81c3b93700 10 monclient: get_auth_request con 0x55e4ace2a000 auth_method 0 debug -1> 2024-01-23T13:09:25.019+0000 7f81c5804a80 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cento s8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/rgw/rgw_amqp.cc: In function 'rgw::amqp::Manager::Manager(size_t, size_t, siz e_t, long int, unsigned int, unsigned int, ceph::common::CephContext*)' thread 7f81c5804a80 time 2024-01-23T13:09:25.016871+0000 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el 8/BUILD/ceph-18.2.1/src/rgw/rgw_amqp.cc: 840: FAILED ceph_assert(rc==0) ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f81cdc09e15] 2: /usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f81cdc09fdb] 3: (rgw::amqp::init(ceph::common::CephContext*)+0x261) [0x55e4aa4a3961] 4: (rgw::AppMain::init_notification_endpoints()+0x38) [0x55e4a9e00d98] 5: main() 6: __libc_start_main() 7: _start() debug 0> 2024-01-23T13:09:25.022+0000 7f81c5804a80 -1 *** Caught signal (Aborted) ** in thread 7f81c5804a80 thread_name:radosgw ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable) 1: /lib64/libpthread.so.0(+0x12d20) [0x7f81cacc1d20] 2: gsignal() 3: abort() 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x7f81cdc09e6f] 5: /usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f81cdc09fdb] 6: (rgw::amqp::init(ceph::common::CephContext*)+0x261) [0x55e4aa4a3961] 7: (rgw::AppMain::init_notification_endpoints()+0x38) [0x55e4a9e00d98] 8: main() 9: __libc_start_main() 10: _start() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 rbd_pwl 0/ 5 journaler 0/ 5 objectcacher 0/ 5 immutable_obj_cache 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 0 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 rgw_sync 1/ 5 rgw_datacache 1/ 5 rgw_access 1/ 5 rgw_dbstore 1/ 5 rgw_flight 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 1/ 5 fuse 2/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace 1/ 5 prioritycache 0/ 5 test 0/ 5 cephfs_mirror 0/ 5 cephsqlite 0/ 5 seastore 0/ 5 seastore_onode 0/ 5 seastore_odata 0/ 5 seastore_omap 0/ 5 seastore_tm 0/ 5 seastore_t 0/ 5 seastore_cleaner 0/ 5 seastore_epm 0/ 5 seastore_lba 0/ 5 seastore_fixedkv_tree 0/ 5 seastore_cache 0/ 5 seastore_journal 0/ 5 seastore_device 0/ 5 seastore_backref 0/ 5 alienstore 1/ 5 mclock 0/ 5 cyanstore 1/ 5 ceph_exporter 1/ 5 memstore -2/-2 (syslog threshold) 99/99 (stderr threshold) --- pthread ID / name mapping for recent threads --- 7f8093c91700 / ms_dispatch 7f8095494700 / io_context_pool 7f819bb43700 / lifecycle_thr_2 7f819db47700 / lifecycle_thr_1 7f819fb4b700 / lifecycle_thr_0 7f81a6358700 / rgw_obj_expirer 7f81a6b59700 / rgw_gc 7f81a8b5d700 / ms_dispatch 7f81aa360700 / io_context_pool 7f81aab61700 / rgw_dt_lg_renew 7f81bbb83700 / safe_timer 7f81bcb85700 / ms_dispatch 7f81bd386700 / ceph_timer 7f81be388700 / io_context_pool 7f81c1b8f700 / admin_socket 7f81c2390700 / service 7f81c2b91700 / msgr-worker-2 7f81c3392700 / msgr-worker-1 7f81c3b93700 / msgr-worker-0 7f81c5804a80 / radosgw max_recent 10000 max_new 1000 log_file /var/lib/ceph/crash/2024-01-23T13:09:25.023127Z_f9c5ce06-f7c6-4a2c-9504-a3f30fa27794/log --- end dump of recent events ---
Cluster status:
cluster: id: f31bf636-769f-423b-bc43-5ccf3d1197b1 health: HEALTH_WARN 739 daemons have recently crashed 11 mgr modules have recently crashed services: mon: 3 daemons, quorum a,b,d (age 8h) mgr: a(active, since 2h), standbys: b mds: 1/1 daemons up, 1 hot standby osd: 8 osds: 8 up (since 8h), 8 in (since 8d) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 15 pools, 281 pgs objects: 31.02k objects, 8.3 GiB usage: 26 GiB used, 3.5 TiB / 3.5 TiB avail pgs: 281 active+clean io: client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
Environment:
OS: CentOS 7.9
Kernel: 3.10.0-1160.105.1.el7.x86_64 #1 SMP Thu Dec 7 15:39:45 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Cloud provider or hardware configuration: microk8s (v1.25) cluster of 4 bare metal nodes (48 cores, 384GB ram each) with 4x 4TB ssds each inside, 2 of them on each node contain one partition configured for osds for ceph.
Rook version: rook: v1.13.2 / go: go1.21.5
Storage backend version: ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
Kubernetes version: Client Version: v1.25.16 / Kustomize Version: v4.5.7 / Server Version: v1.25.16
Kubernetes cluster type: microk8s v1.25.16 revision 6254
Updated by Jo Sta 3 months ago
I am not alone with this problem, 2nd one reported it as comment to my rook issue:
https://github.com/rook/rook/issues/13614#issuecomment-1938451605
Updated by Yuval Lifshitz about 2 months ago
- Assignee set to Yuval Lifshitz
- Tags set to notifications
rook 1.13 supports ceph v17 or v18.
but according to: https://docs.ceph.com/en/latest/start/os-recommendations/
we do not support running ceph v18/v17 on centos7.
since the crash is due to asserting on a failed system call, the problem is most likely due to OS compatibility