Project

General

Profile

Bug #62036

src/mds/MDCache.cc: 5131: FAILED ceph_assert(isolated_inodes.empty())

Added by Venky Shankar 8 months ago. Updated 7 months ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
Backport:
reef,quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
multimds
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/vshankar-2023-07-04_11:59:45-fs-wip-vshankar-testing-20230704.040136-testing-default-smithi/7326619

Description: fs/thrash/workloads/{begin/{0-install 1-ceph 2-logrotate} clusters/1a5s-mds-1c-client conf/{client mds mon osd} distro/{centos_8} mount/kclient/{mount-syntax/{v1} mount overrides/{distro/stock/{k-stock rhel_8} ms-die-on-skipped}} msgr-failures/none objectstore-ec/bluestore-comp overrides/{frag ignorelist_health ignorelist_wrongly_marked_down prefetch_dirfrags/yes prefetch_entire_dirfrags/yes races session_timeout thrashosds-health} ranks/5 tasks/{1-thrash/mds 2-workunit/fs/snaps}}

MDS crash backtrace:

2023-07-04T21:11:10.750 INFO:tasks.ceph.mds.d.smithi060.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-4725-gddde6a3b/rpm/el8/BUILD/ceph-18.0.0-4725-
gddde6a3b/src/mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_ack(ceph::cref_t<MMDSCacheRejoin>&)' thread 7f2bba30f700 time 2023-07-04T21:11:10.750107+0000
2023-07-04T21:11:10.751 INFO:tasks.ceph.mds.d.smithi060.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-4725-gddde6a3b/rpm/el8/BUILD/ceph-18.0.0-4725-gddde6a3b/src/mds/MDCache.cc: 5131: FAILED ceph_assert(isolated_inodes.empty())
2023-07-04T21:11:10.752 INFO:tasks.ceph.mds.d.smithi060.stderr: ceph version 18.0.0-4725-gddde6a3b (ddde6a3b6f51074e32a7fa89805bbbb065d047c2) reef (dev)
2023-07-04T21:11:10.752 INFO:tasks.ceph.mds.d.smithi060.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f2bc5bc7dbf]
2023-07-04T21:11:10.752 INFO:tasks.ceph.mds.d.smithi060.stderr: 2: /usr/lib64/ceph/libceph-common.so.2(+0x2a9f85) [0x7f2bc5bc7f85]
2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 3: (MDCache::handle_cache_rejoin_ack(boost::intrusive_ptr<MMDSCacheRejoin const> const&)+0x2494) [0x559579cdddc4]
2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 4: (MDCache::dispatch(boost::intrusive_ptr<Message const> const&)+0x434) [0x559579d03ca4]
2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 5: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x5a3) [0x559579b6cfb3]
2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 6: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x5c) [0x559579b6d5cc]
2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 7: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1bf) [0x559579b56e4f]
2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 8: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x478) [0x7f2bc5e40828]
2023-07-04T21:11:10.754 INFO:tasks.ceph.mds.d.smithi060.stderr: 9: (DispatchQueue::entry()+0x50f) [0x7f2bc5e3d9cf]
2023-07-04T21:11:10.754 INFO:tasks.ceph.mds.d.smithi060.stderr: 10: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f2bc5f03cc1]
2023-07-04T21:11:10.754 INFO:tasks.ceph.mds.d.smithi060.stderr: 11: /lib64/libpthread.so.0(+0x81cf) [0x7f2bc49701cf]
2023-07-04T21:11:10.754 INFO:tasks.ceph.mds.d.smithi060.stderr: 12: clone()

History

#1 Updated by Venky Shankar 8 months ago

BTW, I did not debug into this as it was unrelated to the PRs in the test branch.

This needs triaged and RCA.

#2 Updated by Xiubo Li 8 months ago

  • Assignee set to Xiubo Li

#3 Updated by Xiubo Li 7 months ago

  • Status changed from New to In Progress

#4 Updated by Xiubo Li 7 months ago

The mds became up:active before receiving the last cache_rejoin ack:

 -1680> 2023-07-04T21:11:07.991+0000 7f2bb830b700  5 mds.beacon.d Sending beacon up:active seq 97
 -1679> 2023-07-04T21:11:07.991+0000 7f2bb830b700 10 monclient: _send_mon_message to mon.c at v2:172.21.15.60:3301/0
 -1678> 2023-07-04T21:11:07.991+0000 7f2bb830b700  1 -- [v2:172.21.15.60:6834/1434001575,v1:172.21.15.60:6835/1434001575] --> [v2:172.21.15.60:3301/0,v1:172.21.15.60:6790/0] -- mdsbeacon(5375/d up:active seq=97 v160) v8 -- 0x55957f09f600 con 0x55957bc11400
 -1677> 2023-07-04T21:11:07.991+0000 7f2bb830b700 20 mds.beacon.d sender thread waiting interval 4s
 -1676> 2023-07-04T21:11:07.992+0000 7f2bbcb14700  1 -- [v2:172.21.15.60:6834/1434001575,v1:172.21.15.60:6835/1434001575] <== mon.2 v2:172.21.15.60:3301/0 191 ==== mdsbeacon(5375/d up:active seq=97 v160) v8 ==== 130+0+0 (secure 0 0 0) 0x55957f09f600 con 0x55957bc11400
 -1675> 2023-07-04T21:11:07.992+0000 7f2bbcb14700  5 mds.beacon.d received beacon reply up:active seq 97 rtt 0.00099999
 -1674> 2023-07-04T21:11:07.992+0000 7f2bb830b700 20 mds.beacon.d sender thread waiting interval 3.999s
 -1673> 2023-07-04T21:11:08.049+0000 7f2bb8b0c700 20 mds.0.118 get_task_status
 -1672> 2023-07-04T21:11:08.049+0000 7f2bb8b0c700 20 mds.0.118 schedule_update_timer_task
  -167> 2023-07-04T21:11:10.747+0000 7f2bba30f700  7 mds.0.cache handle_cache_rejoin cache_rejoin ack v2 from mds.2 (110185 bytes)
  -166> 2023-07-04T21:11:10.747+0000 7f2bba30f700  7 mds.0.cache handle_cache_rejoin_ack from mds.2

And also from the source code's comment that this could happen when the subtree was trimmed before receiving the cache_rejoin ack:

 4940 void MDCache::handle_cache_rejoin_ack(const cref_t<MMDSCacheRejoin> &ack)
 4941 {
...
 5128  
 5129   // FIXME: This can happen if entire subtree, together with the inode subtree root
 5130   // belongs to, were trimmed between sending cache rejoin and receiving rejoin ack.
 5131   ceph_assert(isolated_inodes.empty());                                                                                                                                                                                          
 5132  
...

So we need to defer trimming the subtrees until the last cache_rejoin ack finishes.

#5 Updated by Xiubo Li 7 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 52648

Also available in: Atom PDF