Actions
Bug #62036
opensrc/mds/MDCache.cc: 5131: FAILED ceph_assert(isolated_inodes.empty())
Status:
Pending Backport
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:
0%
Source:
Tags:
backport_processed
Backport:
reef,quincy,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
multimds
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
/a/vshankar-2023-07-04_11:59:45-fs-wip-vshankar-testing-20230704.040136-testing-default-smithi/7326619
Description: fs/thrash/workloads/{begin/{0-install 1-ceph 2-logrotate} clusters/1a5s-mds-1c-client conf/{client mds mon osd} distro/{centos_8} mount/kclient/{mount-syntax/{v1} mount overrides/{distro/stock/{k-stock rhel_8} ms-die-on-skipped}} msgr-failures/none objectstore-ec/bluestore-comp overrides/{frag ignorelist_health ignorelist_wrongly_marked_down prefetch_dirfrags/yes prefetch_entire_dirfrags/yes races session_timeout thrashosds-health} ranks/5 tasks/{1-thrash/mds 2-workunit/fs/snaps}}
MDS crash backtrace:
2023-07-04T21:11:10.750 INFO:tasks.ceph.mds.d.smithi060.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-4725-gddde6a3b/rpm/el8/BUILD/ceph-18.0.0-4725- gddde6a3b/src/mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_ack(ceph::cref_t<MMDSCacheRejoin>&)' thread 7f2bba30f700 time 2023-07-04T21:11:10.750107+0000 2023-07-04T21:11:10.751 INFO:tasks.ceph.mds.d.smithi060.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-4725-gddde6a3b/rpm/el8/BUILD/ceph-18.0.0-4725-gddde6a3b/src/mds/MDCache.cc: 5131: FAILED ceph_assert(isolated_inodes.empty()) 2023-07-04T21:11:10.752 INFO:tasks.ceph.mds.d.smithi060.stderr: ceph version 18.0.0-4725-gddde6a3b (ddde6a3b6f51074e32a7fa89805bbbb065d047c2) reef (dev) 2023-07-04T21:11:10.752 INFO:tasks.ceph.mds.d.smithi060.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f2bc5bc7dbf] 2023-07-04T21:11:10.752 INFO:tasks.ceph.mds.d.smithi060.stderr: 2: /usr/lib64/ceph/libceph-common.so.2(+0x2a9f85) [0x7f2bc5bc7f85] 2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 3: (MDCache::handle_cache_rejoin_ack(boost::intrusive_ptr<MMDSCacheRejoin const> const&)+0x2494) [0x559579cdddc4] 2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 4: (MDCache::dispatch(boost::intrusive_ptr<Message const> const&)+0x434) [0x559579d03ca4] 2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 5: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x5a3) [0x559579b6cfb3] 2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 6: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x5c) [0x559579b6d5cc] 2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 7: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1bf) [0x559579b56e4f] 2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 8: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x478) [0x7f2bc5e40828] 2023-07-04T21:11:10.754 INFO:tasks.ceph.mds.d.smithi060.stderr: 9: (DispatchQueue::entry()+0x50f) [0x7f2bc5e3d9cf] 2023-07-04T21:11:10.754 INFO:tasks.ceph.mds.d.smithi060.stderr: 10: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f2bc5f03cc1] 2023-07-04T21:11:10.754 INFO:tasks.ceph.mds.d.smithi060.stderr: 11: /lib64/libpthread.so.0(+0x81cf) [0x7f2bc49701cf] 2023-07-04T21:11:10.754 INFO:tasks.ceph.mds.d.smithi060.stderr: 12: clone()
Updated by Venky Shankar 10 months ago
BTW, I did not debug into this as it was unrelated to the PRs in the test branch.
This needs triaged and RCA.
Updated by Xiubo Li 10 months ago
The mds became up:active before receiving the last cache_rejoin ack:
-1680> 2023-07-04T21:11:07.991+0000 7f2bb830b700 5 mds.beacon.d Sending beacon up:active seq 97 -1679> 2023-07-04T21:11:07.991+0000 7f2bb830b700 10 monclient: _send_mon_message to mon.c at v2:172.21.15.60:3301/0 -1678> 2023-07-04T21:11:07.991+0000 7f2bb830b700 1 -- [v2:172.21.15.60:6834/1434001575,v1:172.21.15.60:6835/1434001575] --> [v2:172.21.15.60:3301/0,v1:172.21.15.60:6790/0] -- mdsbeacon(5375/d up:active seq=97 v160) v8 -- 0x55957f09f600 con 0x55957bc11400 -1677> 2023-07-04T21:11:07.991+0000 7f2bb830b700 20 mds.beacon.d sender thread waiting interval 4s -1676> 2023-07-04T21:11:07.992+0000 7f2bbcb14700 1 -- [v2:172.21.15.60:6834/1434001575,v1:172.21.15.60:6835/1434001575] <== mon.2 v2:172.21.15.60:3301/0 191 ==== mdsbeacon(5375/d up:active seq=97 v160) v8 ==== 130+0+0 (secure 0 0 0) 0x55957f09f600 con 0x55957bc11400 -1675> 2023-07-04T21:11:07.992+0000 7f2bbcb14700 5 mds.beacon.d received beacon reply up:active seq 97 rtt 0.00099999 -1674> 2023-07-04T21:11:07.992+0000 7f2bb830b700 20 mds.beacon.d sender thread waiting interval 3.999s -1673> 2023-07-04T21:11:08.049+0000 7f2bb8b0c700 20 mds.0.118 get_task_status -1672> 2023-07-04T21:11:08.049+0000 7f2bb8b0c700 20 mds.0.118 schedule_update_timer_task
-167> 2023-07-04T21:11:10.747+0000 7f2bba30f700 7 mds.0.cache handle_cache_rejoin cache_rejoin ack v2 from mds.2 (110185 bytes) -166> 2023-07-04T21:11:10.747+0000 7f2bba30f700 7 mds.0.cache handle_cache_rejoin_ack from mds.2
And also from the source code's comment that this could happen when the subtree was trimmed before receiving the cache_rejoin ack:
4940 void MDCache::handle_cache_rejoin_ack(const cref_t<MMDSCacheRejoin> &ack) 4941 { ... 5128 5129 // FIXME: This can happen if entire subtree, together with the inode subtree root 5130 // belongs to, were trimmed between sending cache rejoin and receiving rejoin ack. 5131 ceph_assert(isolated_inodes.empty()); 5132 ...
So we need to defer trimming the subtrees until the last cache_rejoin ack finishes.
Updated by Rishabh Dave about 1 month ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot about 1 month ago
- Copied to Backport #65321: reef: src/mds/MDCache.cc: 5131: FAILED ceph_assert(isolated_inodes.empty()) added
Updated by Backport Bot about 1 month ago
- Copied to Backport #65322: quincy: src/mds/MDCache.cc: 5131: FAILED ceph_assert(isolated_inodes.empty()) added
Updated by Backport Bot about 1 month ago
- Copied to Backport #65323: squid: src/mds/MDCache.cc: 5131: FAILED ceph_assert(isolated_inodes.empty()) added
Updated by Xiubo Li about 1 month ago
- Backport changed from reef,quincy,pacific to reef,quincy,squid
Updated by Xiubo Li about 1 month ago
- Status changed from Pending Backport to Fix Under Review
Updated by Xiubo Li about 1 month ago
- Status changed from Fix Under Review to Pending Backport
Actions