Bug #60986
opencrash: void MDCache::rejoin_send_rejoins(): assert(auth >= 0)
0%
5069956bbe0e3828fa162226b569d9ebfaea1b69544f0f4d5b3cf057977b8640
Description
Assert condition: auth >= 0
Assert function: void MDCache::rejoin_send_rejoins()
Sanitized backtrace:
MDCache::rejoin_send_rejoins() MDSRankDispatcher::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&, MDSMap const&) MDSDaemon::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&) MDSDaemon::handle_core_message(boost::intrusive_ptr<Message const> const&) MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&) DispatchQueue::entry() DispatchQueue::DispatchThread::entry()
Crash dump sample:
{ "assert_condition": "auth >= 0", "assert_file": "mds/MDCache.cc", "assert_func": "void MDCache::rejoin_send_rejoins()", "assert_line": 4084, "assert_msg": "mds/MDCache.cc: In function 'void MDCache::rejoin_send_rejoins()' thread 7fad76ed2700 time 2023-04-25T04:30:36.005858-0400\nmds/MDCache.cc: 4084: FAILED ceph_assert(auth >= 0)", "assert_thread_name": "ms_dispatch", "backtrace": [ "/lib64/libpthread.so.0(+0x12d80) [0x7fad7f7e5d80]", "gsignal()", "abort()", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x7fad807e947b]", "/usr/lib64/ceph/libceph-common.so.2(+0x2695e7) [0x7fad807e95e7]", "(MDCache::rejoin_send_rejoins()+0x2013) [0x55fd558e2eb3]", "(MDSRankDispatcher::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&, MDSMap const&)+0x1b72) [0x55fd5577d2a2]", "(MDSDaemon::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&)+0xd66) [0x55fd5574e9b6]", "(MDSDaemon::handle_core_message(boost::intrusive_ptr<Message const> const&)+0x367) [0x55fd55751f77]", "(MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x177) [0x55fd557526f7]", "(Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x478) [0x7fad80a64668]", "(DispatchQueue::entry()+0x50f) [0x7fad80a61aaf]", "(DispatchQueue::DispatchThread::entry()+0x11) [0x7fad80b28ef1]", "/lib64/libpthread.so.0(+0x82de) [0x7fad7f7db2de]", "clone()" ], "ceph_version": "17.2.5", "crash_id": "2023-04-25T08:30:36.008366Z_31305a85-41e6-4558-9503-9dd700dfb575", "entity_name": "mds.ea712f0f7730e25884ca687e6f7d3d73e1e91998", "os_id": "centos", "os_name": "CentOS Linux", "os_version": "8 (Core)", "os_version_id": "8", "process_name": "ceph-mds", "stack_sig": "5069956bbe0e3828fa162226b569d9ebfaea1b69544f0f4d5b3cf057977b8640", "timestamp": "2023-04-25T08:30:36.008366Z", "utsname_machine": "x86_64", "utsname_release": "4.18.0-80.el8.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP Tue Jun 4 09:19:46 UTC 2019" }
Files
Updated by Milind Changire 11 months ago
- Related to Bug #54765: crash: void MDCache::rejoin_send_rejoins(): assert(auth >= 0) added
Updated by Xiubo Li 4 days ago · Edited
A new report from the ceph-user mail list: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/GOAZLA6NQHYFM3OHQ6ALWU6VQENCB5OS/
Hi, a 17.2.7 cluster with two filesystems has suddenly non-working MDSs: # ceph -s cluster: id: f54eea86-265a-11eb-a5d0-457857ba5742 health: HEALTH_ERR 22 failed cephadm daemon(s) 2 filesystems are degraded 1 mds daemon damaged insufficient standby MDS daemons available services: mon: 5 daemons, quorum ceph00,ceph03,ceph04,ceph01,ceph02 (age 4h) mgr: ceph03.odfupq(active, since 4h), standbys: ppc721.vsincn, ceph00.lvbddp, ceph02.zhyxjg, ceph06.eifppc mds: 4/5 daemons up osd: 145 osds: 145 up (since 20h), 145 in (since 2d) rgw: 12 daemons active (4 hosts, 1 zones) data: volumes: 0/2 healthy, 2 recovering; 1 damaged pools: 15 pools, 4897 pgs objects: 195.64M objects, 195 TiB usage: 617 TiB used, 527 TiB / 1.1 PiB avail pgs: 4892 active+clean 5 active+clean+scrubbing+deep io: client: 2.5 MiB/s rd, 20 MiB/s wr, 665 op/s rd, 938 op/s wr # ceph fs status ABC - 4 clients =========== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 failed 1 resolve ABC.ceph04.lzlkdu 0 3 1 0 2 resolve ABC.ppc721.rzfmyi 0 3 1 0 3 resolve ABC.ceph04.jiepaw 249 252 13 0 POOL TYPE USED AVAIL cephfs.ABC.meta metadata 33.0G 104T cephfs.ABC.data data 390T 104T DEF - 154 clients =========== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 rejoin(laggy) DEF.ceph06.etthum 30.9k 30.8k 5084 0 POOL TYPE USED AVAIL cephfs.DEF.meta metadata 190G 104T cephfs.DEF.data data 118T 104T MDS version: ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable) The first filesystem will not get an MDS in rank 0, we already tried to set max_msd to 1 but to no avail. The second filesystem's MDS shows "replay" for a while and then it crashes in the rejoin phase with: -92> 2024-05-06T16:07:15.514+0000 7f1927e9d700 1 mds.0.501522 handle_mds_map i am now mds.0.501522 -91> 2024-05-06T16:07:15.514+0000 7f1927e9d700 1 mds.0.501522 handle_mds_map state change up:reconnect --> up:rejoin -90> 2024-05-06T16:07:15.514+0000 7f1927e9d700 1 mds.0.501522 rejoin_start -89> 2024-05-06T16:07:15.514+0000 7f1927e9d700 1 mds.0.501522 rejoin_joint_start -88> 2024-05-06T16:07:15.514+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x300005bfece err -22/0 -87> 2024-05-06T16:07:15.514+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x30000671eb5 err -22/0 -86> 2024-05-06T16:07:15.514+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x300005bfed3 err -22/0 -85> 2024-05-06T16:07:15.514+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc94c err -22/0 -84> 2024-05-06T16:07:15.514+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x300005b0274 err -22/0 -83> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x30000671eb5 err -22/0 -82> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc94c err -22/0 -81> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x30000671ebd err -22/-22 -80> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x30000671ecd err -22/-22 -79> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc9ea err -22/-22 -78> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x300005bfed3 err -22/0 -77> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc9c3 err -22/-22 -76> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc978 err -22/-22 -75> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc99d err -22/-22 -74> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc95b err -22/-22 -73> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc980 err -22/-22 -72> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x300005b0274 err -22/0 -71> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x20001dc7a7e err -22/-22 -70> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012be364 err -22/-22 -69> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x300005b2e32 err -22/-22 -68> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x30000671eb5 err -22/0 -67> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc94c err -22/0 -66> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x30000671ebd err -22/-22 -65> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x30000671ecd err -22/-22 -64> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc9ea err -22/-22 -63> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc978 err -22/-22 -62> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x3000069373a err -22/-22 -61> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012dc5d8 err -22/-22 -60> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x400000a32e8 err -22/-22 -59> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x30000696952 err -22/-22 -58> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc94c err -22/0 -57> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x300005bfed3 err -22/0 -56> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc99d err -22/-22 -55> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc980 err -22/-22 -54> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x20001dc7a7e err -22/-22 -53> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x400000a58cf err -22/-22 -52> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012c3a0e err -22/-22 -51> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc94c err -22/0 -50> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x300005b0274 err -22/0 -49> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x20001dc7a7f err -22/-22 -48> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc95b err -22/-22 -47> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x300005b2e32 err -22/-22 -46> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012be364 err -22/-22 -45> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x300005b388d err -22/-22 -44> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x10007185ac2 err -22/-22 -43> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc94c err -22/0 -42> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc980 err -22/-22 -41> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc99d err -22/-22 -40> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x400000a58cf err -22/-22 -39> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012c3a0e err -22/-22 -38> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x400000a58db err -22/-22 -37> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x40000d63bff err -22/-22 -36> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc94c err -22/0 -35> 2024-05-06T16:07:15.526+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc95b err -22/-22 -34> 2024-05-06T16:07:15.526+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012be364 err -22/-22 -33> 2024-05-06T16:07:15.526+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x10007185ac2 err -22/-22 -32> 2024-05-06T16:07:15.526+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc94c err -22/0 -31> 2024-05-06T16:07:15.526+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x10007185ac4 err -22/-22 -30> 2024-05-06T16:07:15.526+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc980 err -22/-22 -29> 2024-05-06T16:07:15.526+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x400000a58cf err -22/-22 -28> 2024-05-06T16:07:15.526+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x400000a58db err -22/-22 -27> 2024-05-06T16:07:15.526+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x400000a58f4 err -22/-22 -26> 2024-05-06T16:07:15.530+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc94c err -22/0 -25> 2024-05-06T16:07:15.534+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc980 err -22/-22 -24> 2024-05-06T16:07:15.534+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x400000a58cf err -22/-22 -23> 2024-05-06T16:07:15.534+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x400000a58db err -22/-22 -22> 2024-05-06T16:07:15.534+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x400000a58f4 err -22/-22 -21> 2024-05-06T16:07:15.534+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x400000a5fc4 err -22/-22 -20> 2024-05-06T16:07:15.534+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x400000a634d err -22/-22 -19> 2024-05-06T16:07:15.534+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x400000a63bf err -22/-22 -18> 2024-05-06T16:07:15.542+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc94c err -22/0 -17> 2024-05-06T16:07:15.542+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bc980 err -22/-22 -16> 2024-05-06T16:07:15.546+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x400000a58cf err -22/-22 -15> 2024-05-06T16:07:15.550+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x400000a58db err -22/-22 -14> 2024-05-06T16:07:15.550+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x400000a58f4 err -22/-22 -13> 2024-05-06T16:07:15.550+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x400000a5fc4 err -22/-22 -12> 2024-05-06T16:07:15.550+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x400000a634d err -22/-22 -11> 2024-05-06T16:07:15.550+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x400000a63bf err -22/-22 -10> 2024-05-06T16:07:15.550+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bfd9c err -22/-22 -9> 2024-05-06T16:07:15.550+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x200012bfb78 err -22/-22 -8> 2024-05-06T16:07:15.554+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x300005b0274 err -22/0 -7> 2024-05-06T16:07:15.554+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x300005b2e32 err -22/-22 -6> 2024-05-06T16:07:15.554+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x300005b388d err -22/-22 -5> 2024-05-06T16:07:15.562+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x300005b0274 err -22/0 -4> 2024-05-06T16:07:15.562+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x300005b2e32 err -22/-22 -3> 2024-05-06T16:07:15.562+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x300005b388d err -22/-22 -2> 2024-05-06T16:07:15.562+0000 7f1921e91700 0 mds.0.cache failed to open ino 0x40000d5a226 err -22/-22 -1> 2024-05-06T16:07:15.634+0000 7f1921e91700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDCache.cc: In function 'void MDCache::rejoin_send_rejoins()' thread 7f1921e91700 time 2024-05-06T16:07:15.635683+0000 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDCache.cc: 4086: FAILED ceph_assert(auth >= 0) ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f1930ad94a3] 2: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f1930ad9669] 3: (MDCache::rejoin_send_rejoins()+0x216b) [0x5614ac8747eb] 4: (MDCache::process_imported_caps()+0x1993) [0x5614ac872353] 5: (Context::complete(int)+0xd) [0x5614ac6e182d] 6: (MDSContext::complete(int)+0x5f) [0x5614aca41f4f] 7: (void finish_contexts<std::vector<MDSContext*, std::allocator<MDSContext*> > >(ceph::common::CephContext*, std::vector<MDSContext*, std::allocator<MDSContext*> >&, int)+0x8d) [0x5614ac6e6f5d] 8: (OpenFileTable::_open_ino_finish(inodeno_t, int)+0x156) [0x5614aca765a6] 9: (MDSContext::complete(int)+0x5f) [0x5614aca41f4f] 10: (void finish_contexts<std::vector<MDSContext*, std::allocator<MDSContext*> > >(ceph::common::CephContext*, std::vector<MDSContext*, std::allocator<MDSContext*> >&, int)+0x8d) [0x5614ac6e6f5d] 11: (MDCache::open_ino_finish(inodeno_t, MDCache::open_ino_info_t&, int)+0x138) [0x5614ac867168] 12: (MDCache::_open_ino_backtrace_fetched(inodeno_t, ceph::buffer::v15_2_0::list&, int)+0x290) [0x5614ac87ff90] 13: (MDSContext::complete(int)+0x5f) [0x5614aca41f4f] 14: (MDSIOContextBase::complete(int)+0x534) [0x5614aca426e4] 15: (Finisher::finisher_thread_entry()+0x18d) [0x7f1930b7884d] 16: /lib64/libpthread.so.0(+0x81ca) [0x7f192fac81ca] 17: clone() How do we solve this issue?
Updated by Xiubo Li 4 days ago
Another one https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NDWFYV5XFDCUW5EBRWXEDQFGVFL5HAIV/:
Hi all, we have a serious problem with CephFS. A few days ago, the CephFS file systems became inaccessible, with the message MDS_DAMAGE: 1 mds daemon damaged The cephfs-journal-tool tells us: "Overall journal integrity: OK" The usual attempts with redeploy were unfortunately not successful. After many attempts to achieve something with the orchestrator, we set the MDS to “failed” and provoked the creation of new MDS with “ceph fs reset”. But this MDS crashes: ceph-17.2.7/src/mds/MDCache.cc: In function 'void MDCache::rejoin_send_rejoins()' ceph-17.2.7/src/mds/MDCache.cc: 4086: FAILED ceph_assert(auth >= 0) (The full trace is attached). What can we do now? We are grateful for any help!
Updated by Robert Sander 4 days ago
- File ceph-mds.vol-bierinf.ceph04.jiepaw.log ceph-mds.vol-bierinf.ceph04.jiepaw.log added
- File ceph-mds.vol-bierinf.ceph04.lzlkdu.log ceph-mds.vol-bierinf.ceph04.lzlkdu.log added
- File ceph-mds.vol-bierinf.ppc721.dfkfzy.log ceph-mds.vol-bierinf.ppc721.dfkfzy.log added
- File ceph-mds.vol-bierinf.ppc721.rzfmyi.log ceph-mds.vol-bierinf.ppc721.rzfmyi.log added
Xiubo Li wrote in #note-4:
A new report from the ceph-user mail list: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/GOAZLA6NQHYFM3OHQ6ALWU6VQENCB5OS/
We now collected debug logs. The following actions have been made:
ceph fs set ksz-cephfs2 max_mds 1
ceph fs set vol-bierinf max_mds 1
ceph fs set ksz-cephfs2 standby_count_wanted 1
ceph fs set vol-bierinf standby_count_wanted 1
# for _md in $(ceph orch ps | grep mds | cut -d ' ' -f 1)
> do
> ceph orch daemon stop $_md
> done
...
# ceph fs status
vol-bierinf - 0 clients
===========
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 failed
1 failed
2 failed
3 failed
POOL TYPE USED AVAIL
cephfs.vol-bierinf.meta metadata 33.0G 104T
cephfs.vol-bierinf.data data 390T 104T
ksz-cephfs2 - 0 clients
===========
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 failed
POOL TYPE USED AVAIL
cephfs.ksz-cephfs2.meta metadata 190G 104T
cephfs.ksz-cephfs2.data data 118T 104T
# ceph fs dump
e501532
enable_multiple, ever_enabled_multiple: 1,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned
encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: -1
Filesystem 'vol-bierinf' (10)
fs_name vol-bierinf
epoch 501532
flags 12 joinable allow_snaps allow_multimds_snaps
created 2021-05-04T12:21:48.898727+0000
modified 2024-05-07T06:37:23.924755+0000
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 8796093022208
required_client_features {}
last_failure 0
last_failure_osd_epoch 279880
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default
file layouts on dirs,4=dir inode in separate object,5=mds uses versioned
encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file
layout v2,10=snaprealm v2}
max_mds 1
in 0,1,2,3
up {}
failed 1,2,3
damaged 0
stopped 4,5,6,7
data_pools [26]
metadata_pool 25
inline_data disabled
balancer
standby_count_wanted 1
Filesystem 'ksz-cephfs2' (11)
fs_name ksz-cephfs2
epoch 501530
flags 12 joinable allow_snaps allow_multimds_snaps
created 2024-05-05T19:00:35.635663+0000
modified 2024-05-07T06:36:09.307179+0000
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
required_client_features {}
last_failure 0
last_failure_osd_epoch 279575
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default
file layouts on dirs,4=dir inode in separate object,5=mds uses versioned
encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file
layout v2,10=snaprealm v2}
max_mds 1
in 0
up {0=8532994}
failed
damaged
stopped 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
data_pools [37]
metadata_pool 36
inline_data disabled
balancer
standby_count_wanted 1
[mds.vol-bierinf.ppc721.meydsp{0:8532994} state up:rejoin seq 4 laggy since
2024-05-06T16:07:32.761409+0000 join_fscid=10 addr
[v2:10.149.12.250:6802/2957952954,v1:10.149.12.250:6803/2957952954] compat
{c=[1],r=[1],i=[7ff]}]
dumped fsmap epoch 501532
ceph config set global debug_mds 25
ceph config set global debug_ms 1
ceph orch daemon start mds.vol-bierinf.ceph04.jiepaw
ceph orch daemon start mds.vol-bierinf.ceph04.lzlkdu # erst in ksz-cephfs2(crash), dann in
vol-bierinf
ceph orch daemon start mds.vol-bierinf.ppc721.dfkfzy
ceph orch daemon start mds.vol-bierinf.ppc721.rzfmyi
ceph fs status
vol-bierinf - 4 clients
===========
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 failed
1 resolve vol-bierinf.ceph04.jiepaw 0 3 1 0
2 resolve vol-bierinf.ceph04.lzlkdu 0 3 1 0
3 resolve vol-bierinf.ppc721.dfkfzy 249 252 13 0
POOL TYPE USED AVAIL
cephfs.vol-bierinf.meta metadata 33.0G 104T
cephfs.vol-bierinf.data data 390T 104T
ksz-cephfs2 - 154 clients
===========
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 rejoin(laggy) vol-bierinf.ppc721.rzfmyi 143 102 14 0
POOL TYPE USED AVAIL
cephfs.ksz-cephfs2.meta metadata 190G 104T
cephfs.ksz-cephfs2.data data 118T 104T
MDS version: ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy
(stable)
ceph config rm global debug_mds
ceph config rm global debug_ms
ceph fs dump
e501610
enable_multiple, ever_enabled_multiple: 1,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned
encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: -1
Filesystem 'vol-bierinf' (10)
fs_name vol-bierinf
epoch 501589
flags 12 joinable allow_snaps allow_multimds_snaps
created 2021-05-04T12:21:48.898727+0000
modified 2024-05-07T07:17:51.056080+0000
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 8796093022208
required_client_features {}
last_failure 0
last_failure_osd_epoch 279909
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default
file layouts on dirs,4=dir inode in separate object,5=mds uses versioned
encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file
layout v2,10=snaprealm v2}
max_mds 1
in 0,1,2,3
up {1=8559832,2=8559872,3=8559902}
failed
damaged 0
stopped 4,5,6,7
data_pools [26]
metadata_pool 25
inline_data disabled
balancer
standby_count_wanted 1
[mds.vol-bierinf.ceph04.jiepaw{1:8559832} state up:resolve seq 2 join_fscid=10 addr
[v2:10.149.12.14:6800/3253575127,v1:10.149.12.14:6801/3253575127] compat
{c=[1],r=[1],i=[7ff]}]
[mds.vol-bierinf.ceph04.lzlkdu{2:8559872} state up:resolve seq 2 join_fscid=10 addr
[v2:10.149.12.14:6802/4188570865,v1:10.149.12.14:6803/4188570865] compat
{c=[1],r=[1],i=[7ff]}]
[mds.vol-bierinf.ppc721.dfkfzy{3:8559902} state up:resolve seq 2 join_fscid=10 addr
[v2:10.149.12.250:6800/5841986,v1:10.149.12.250:6801/5841986] compat
{c=[1],r=[1],i=[7ff]}]
Filesystem 'ksz-cephfs2' (11)
fs_name ksz-cephfs2
epoch 501610
flags 12 joinable allow_snaps allow_multimds_snaps
created 2024-05-05T19:00:35.635663+0000
modified 2024-05-07T07:21:25.482889+0000
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
required_client_features {}
last_failure 0
last_failure_osd_epoch 279922
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default
file layouts on dirs,4=dir inode in separate object,5=mds uses versioned
encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file
layout v2,10=snaprealm v2}
max_mds 1
in 0
up {0=8536329}
failed
damaged
stopped 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
data_pools [37]
metadata_pool 36
inline_data disabled
balancer
standby_count_wanted 1
[mds.vol-bierinf.ppc721.rzfmyi{0:8536329} state up:rejoin seq 4 laggy since
2024-05-07T07:21:25.482828+0000 join_fscid=10 addr
[v2:10.149.12.250:6802/3452773200,v1:10.149.12.250:6803/3452773200] compat
{c=[1],r=[1],i=[7ff]}]
dumped fsmap epoch 501610
Updated by Robert Sander 3 days ago
While trying to export the journal the following error shows up:
# cephfs-journal-tool --rank=ksz-cephfs2:0 journal export
/tmp/ksz-cephfs2.rank0.journal.export.bin
journal is 98761711816021~506984147
2024-05-07T16:59:38.160+0200 7f412db87040 -1 Error 22 ((22) Invalid argument) seeking to
0x59d2c0c00d55
Error ((22) Invalid argument)
Updated by Xiubo Li 3 days ago
Robert Sander wrote in #note-6:
Xiubo Li wrote in #note-4:
A new report from the ceph-user mail list: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/GOAZLA6NQHYFM3OHQ6ALWU6VQENCB5OS/
We now collected debug logs. The following actions have been made:
[...]
Since this tracker is assigned to Milind, I will leave it to him. Thanks!
Updated by Robert Sander 3 days ago
Can we please bump the severity a few levels?
There is loss of production as the MDS are currently not running. Even after the upgrade to 18.2.2 the error just after starting happens.
We will collect debug logs and add them here.
Updated by Robert Sander 3 days ago
These are the current actions tried:
# cephfs-journal-tool --rank=ksz-cephfs2:0 event recover_dentries summary
Events by type:
COMMITTED: 45
EXPORT: 1930
IMPORTFINISH: 2151
IMPORTSTART: 2153
OPEN: 3788
PEERUPDATE: 118
SESSION: 198
SESSIONS: 1
SUBTREEMAP: 128
UPDATE: 43121
Errors: 0
# ca. 10-15 Min.
### JOURNAL TRUNCATION
# cephfs-journal-tool --rank=ksz-cephfs2:0 journal reset
old journal was 98761711816021~506984147
new journal start will be 98762219323392 (523224 bytes past old end)
writing journal head
writing EResetJournal entry
done
### MDS TABLE WIPES
# cephfs-table-tool --rank=ksz-cephfs2:0 reset session
Error (2024-05-08T13:25:25.066+0200 7f7184801000 -1 main: Bad rank selection: --rank=ksz-cephfs2:0'
(2) No such file or directory)
# cephfs-table-tool --rank=ksz-cephfs2:0 reset snap
Error (2024-05-08T13:52:27.464+0200 7f622ab2f000 -1 main: Bad rank selection: --rank=ksz-cephfs2:0'
(2) No such file or directory)
# this is different
# cephfs-table-tool --rank=ksz-cephfs2:0 reset inode
Error (2024-05-08T13:53:44.254+0200 7f2c30c85000 -1 main: Bad rank selection: --rank=ksz-cephfs2:0'
(2) No such file or directory)
# ceph fs reset ksz-cephfs2 --yes-i-really-mean-it
Error EINVAL: all MDS daemons must be inactive before resetting filesystem: set the
cluster_down flag and use `ceph mds fail` to make this so
Updated by Robert Sander about 20 hours ago
The cluster now serves the two CephFS again after running these commands for both:
cephfs-journal-tool event recover_dentries summary
cephfs-journal-tool journal reset
ceph fs reset