Actions
Bug #62036
opensrc/mds/MDCache.cc: 5131: FAILED ceph_assert(isolated_inodes.empty())
Status:
Pending Backport
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:
0%
Source:
Tags:
backport_processed
Backport:
reef,quincy,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
multimds
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
/a/vshankar-2023-07-04_11:59:45-fs-wip-vshankar-testing-20230704.040136-testing-default-smithi/7326619
Description: fs/thrash/workloads/{begin/{0-install 1-ceph 2-logrotate} clusters/1a5s-mds-1c-client conf/{client mds mon osd} distro/{centos_8} mount/kclient/{mount-syntax/{v1} mount overrides/{distro/stock/{k-stock rhel_8} ms-die-on-skipped}} msgr-failures/none objectstore-ec/bluestore-comp overrides/{frag ignorelist_health ignorelist_wrongly_marked_down prefetch_dirfrags/yes prefetch_entire_dirfrags/yes races session_timeout thrashosds-health} ranks/5 tasks/{1-thrash/mds 2-workunit/fs/snaps}}
MDS crash backtrace:
2023-07-04T21:11:10.750 INFO:tasks.ceph.mds.d.smithi060.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-4725-gddde6a3b/rpm/el8/BUILD/ceph-18.0.0-4725- gddde6a3b/src/mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_ack(ceph::cref_t<MMDSCacheRejoin>&)' thread 7f2bba30f700 time 2023-07-04T21:11:10.750107+0000 2023-07-04T21:11:10.751 INFO:tasks.ceph.mds.d.smithi060.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-4725-gddde6a3b/rpm/el8/BUILD/ceph-18.0.0-4725-gddde6a3b/src/mds/MDCache.cc: 5131: FAILED ceph_assert(isolated_inodes.empty()) 2023-07-04T21:11:10.752 INFO:tasks.ceph.mds.d.smithi060.stderr: ceph version 18.0.0-4725-gddde6a3b (ddde6a3b6f51074e32a7fa89805bbbb065d047c2) reef (dev) 2023-07-04T21:11:10.752 INFO:tasks.ceph.mds.d.smithi060.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f2bc5bc7dbf] 2023-07-04T21:11:10.752 INFO:tasks.ceph.mds.d.smithi060.stderr: 2: /usr/lib64/ceph/libceph-common.so.2(+0x2a9f85) [0x7f2bc5bc7f85] 2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 3: (MDCache::handle_cache_rejoin_ack(boost::intrusive_ptr<MMDSCacheRejoin const> const&)+0x2494) [0x559579cdddc4] 2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 4: (MDCache::dispatch(boost::intrusive_ptr<Message const> const&)+0x434) [0x559579d03ca4] 2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 5: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x5a3) [0x559579b6cfb3] 2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 6: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x5c) [0x559579b6d5cc] 2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 7: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1bf) [0x559579b56e4f] 2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 8: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x478) [0x7f2bc5e40828] 2023-07-04T21:11:10.754 INFO:tasks.ceph.mds.d.smithi060.stderr: 9: (DispatchQueue::entry()+0x50f) [0x7f2bc5e3d9cf] 2023-07-04T21:11:10.754 INFO:tasks.ceph.mds.d.smithi060.stderr: 10: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f2bc5f03cc1] 2023-07-04T21:11:10.754 INFO:tasks.ceph.mds.d.smithi060.stderr: 11: /lib64/libpthread.so.0(+0x81cf) [0x7f2bc49701cf] 2023-07-04T21:11:10.754 INFO:tasks.ceph.mds.d.smithi060.stderr: 12: clone()
Updated by Venky Shankar 10 months ago
BTW, I did not debug into this as it was unrelated to the PRs in the test branch.
This needs triaged and RCA.
Updated by Xiubo Li 9 months ago
The mds became up:active before receiving the last cache_rejoin ack:
-1680> 2023-07-04T21:11:07.991+0000 7f2bb830b700 5 mds.beacon.d Sending beacon up:active seq 97 -1679> 2023-07-04T21:11:07.991+0000 7f2bb830b700 10 monclient: _send_mon_message to mon.c at v2:172.21.15.60:3301/0 -1678> 2023-07-04T21:11:07.991+0000 7f2bb830b700 1 -- [v2:172.21.15.60:6834/1434001575,v1:172.21.15.60:6835/1434001575] --> [v2:172.21.15.60:3301/0,v1:172.21.15.60:6790/0] -- mdsbeacon(5375/d up:active seq=97 v160) v8 -- 0x55957f09f600 con 0x55957bc11400 -1677> 2023-07-04T21:11:07.991+0000 7f2bb830b700 20 mds.beacon.d sender thread waiting interval 4s -1676> 2023-07-04T21:11:07.992+0000 7f2bbcb14700 1 -- [v2:172.21.15.60:6834/1434001575,v1:172.21.15.60:6835/1434001575] <== mon.2 v2:172.21.15.60:3301/0 191 ==== mdsbeacon(5375/d up:active seq=97 v160) v8 ==== 130+0+0 (secure 0 0 0) 0x55957f09f600 con 0x55957bc11400 -1675> 2023-07-04T21:11:07.992+0000 7f2bbcb14700 5 mds.beacon.d received beacon reply up:active seq 97 rtt 0.00099999 -1674> 2023-07-04T21:11:07.992+0000 7f2bb830b700 20 mds.beacon.d sender thread waiting interval 3.999s -1673> 2023-07-04T21:11:08.049+0000 7f2bb8b0c700 20 mds.0.118 get_task_status -1672> 2023-07-04T21:11:08.049+0000 7f2bb8b0c700 20 mds.0.118 schedule_update_timer_task
-167> 2023-07-04T21:11:10.747+0000 7f2bba30f700 7 mds.0.cache handle_cache_rejoin cache_rejoin ack v2 from mds.2 (110185 bytes) -166> 2023-07-04T21:11:10.747+0000 7f2bba30f700 7 mds.0.cache handle_cache_rejoin_ack from mds.2
And also from the source code's comment that this could happen when the subtree was trimmed before receiving the cache_rejoin ack:
4940 void MDCache::handle_cache_rejoin_ack(const cref_t<MMDSCacheRejoin> &ack) 4941 { ... 5128 5129 // FIXME: This can happen if entire subtree, together with the inode subtree root 5130 // belongs to, were trimmed between sending cache rejoin and receiving rejoin ack. 5131 ceph_assert(isolated_inodes.empty()); 5132 ...
So we need to defer trimming the subtrees until the last cache_rejoin ack finishes.
Updated by Rishabh Dave about 1 month ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot about 1 month ago
- Copied to Backport #65321: reef: src/mds/MDCache.cc: 5131: FAILED ceph_assert(isolated_inodes.empty()) added
Updated by Backport Bot about 1 month ago
- Copied to Backport #65322: quincy: src/mds/MDCache.cc: 5131: FAILED ceph_assert(isolated_inodes.empty()) added
Updated by Backport Bot about 1 month ago
- Copied to Backport #65323: squid: src/mds/MDCache.cc: 5131: FAILED ceph_assert(isolated_inodes.empty()) added
Actions