Bug #62036: src/mds/MDCache.cc: 5131: FAILED ceph_assert(isolated_inodes.empty()) - CephFS - Ceph

Actions

Copy link

Bug #62036

open

src/mds/MDCache.cc: 5131: FAILED ceph_assert(isolated_inodes.empty())

Added by Venky Shankar 10 months ago. Updated 27 days ago.

Status:

Pending Backport

Priority:

Normal

Assignee:

Xiubo Li

Category:

Correctness/Safety

Target version:

Ceph - v19.0.0

% Done:

Source:

Tags:

backport_processed

Backport:

reef,quincy,squid

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

multimds

Pull request ID:

52648

Crash signature (v1):

Crash signature (v2):

Description

/a/vshankar-2023-07-04_11:59:45-fs-wip-vshankar-testing-20230704.040136-testing-default-smithi/7326619

Description: fs/thrash/workloads/{begin/{0-install 1-ceph 2-logrotate} clusters/1a5s-mds-1c-client conf/{client mds mon osd} distro/{centos_8} mount/kclient/{mount-syntax/{v1} mount overrides/{distro/stock/{k-stock rhel_8} ms-die-on-skipped}} msgr-failures/none objectstore-ec/bluestore-comp overrides/{frag ignorelist_health ignorelist_wrongly_marked_down prefetch_dirfrags/yes prefetch_entire_dirfrags/yes races session_timeout thrashosds-health} ranks/5 tasks/{1-thrash/mds 2-workunit/fs/snaps}}

MDS crash backtrace:

2023-07-04T21:11:10.750 INFO:tasks.ceph.mds.d.smithi060.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-4725-gddde6a3b/rpm/el8/BUILD/ceph-18.0.0-4725-
gddde6a3b/src/mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_ack(ceph::cref_t<MMDSCacheRejoin>&)' thread 7f2bba30f700 time 2023-07-04T21:11:10.750107+0000
2023-07-04T21:11:10.751 INFO:tasks.ceph.mds.d.smithi060.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-4725-gddde6a3b/rpm/el8/BUILD/ceph-18.0.0-4725-gddde6a3b/src/mds/MDCache.cc: 5131: FAILED ceph_assert(isolated_inodes.empty())
2023-07-04T21:11:10.752 INFO:tasks.ceph.mds.d.smithi060.stderr: ceph version 18.0.0-4725-gddde6a3b (ddde6a3b6f51074e32a7fa89805bbbb065d047c2) reef (dev)
2023-07-04T21:11:10.752 INFO:tasks.ceph.mds.d.smithi060.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f2bc5bc7dbf]
2023-07-04T21:11:10.752 INFO:tasks.ceph.mds.d.smithi060.stderr: 2: /usr/lib64/ceph/libceph-common.so.2(+0x2a9f85) [0x7f2bc5bc7f85]
2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 3: (MDCache::handle_cache_rejoin_ack(boost::intrusive_ptr<MMDSCacheRejoin const> const&)+0x2494) [0x559579cdddc4]
2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 4: (MDCache::dispatch(boost::intrusive_ptr<Message const> const&)+0x434) [0x559579d03ca4]
2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 5: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x5a3) [0x559579b6cfb3]
2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 6: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x5c) [0x559579b6d5cc]
2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 7: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1bf) [0x559579b56e4f]
2023-07-04T21:11:10.753 INFO:tasks.ceph.mds.d.smithi060.stderr: 8: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x478) [0x7f2bc5e40828]
2023-07-04T21:11:10.754 INFO:tasks.ceph.mds.d.smithi060.stderr: 9: (DispatchQueue::entry()+0x50f) [0x7f2bc5e3d9cf]
2023-07-04T21:11:10.754 INFO:tasks.ceph.mds.d.smithi060.stderr: 10: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f2bc5f03cc1]
2023-07-04T21:11:10.754 INFO:tasks.ceph.mds.d.smithi060.stderr: 11: /lib64/libpthread.so.0(+0x81cf) [0x7f2bc49701cf]
2023-07-04T21:11:10.754 INFO:tasks.ceph.mds.d.smithi060.stderr: 12: clone()

Related issues 3 (3 open — 0 closed)

Actions

Copy link

Updated by Venky Shankar 10 months ago

BTW, I did not debug into this as it was unrelated to the PRs in the test branch.

This needs triaged and RCA.

Actions

Copy link

Updated by Xiubo Li 10 months ago

Assignee set to Xiubo Li

Actions

Copy link

Updated by Xiubo Li 9 months ago

Status changed from New to In Progress

Actions

Copy link

Updated by Xiubo Li 9 months ago

The mds became up:active before receiving the last cache_rejoin ack:

 -1680> 2023-07-04T21:11:07.991+0000 7f2bb830b700  5 mds.beacon.d Sending beacon up:active seq 97
 -1679> 2023-07-04T21:11:07.991+0000 7f2bb830b700 10 monclient: _send_mon_message to mon.c at v2:172.21.15.60:3301/0
 -1678> 2023-07-04T21:11:07.991+0000 7f2bb830b700  1 -- [v2:172.21.15.60:6834/1434001575,v1:172.21.15.60:6835/1434001575] --> [v2:172.21.15.60:3301/0,v1:172.21.15.60:6790/0] -- mdsbeacon(5375/d up:active seq=97 v160) v8 -- 0x55957f09f600 con 0x55957bc11400
 -1677> 2023-07-04T21:11:07.991+0000 7f2bb830b700 20 mds.beacon.d sender thread waiting interval 4s
 -1676> 2023-07-04T21:11:07.992+0000 7f2bbcb14700  1 -- [v2:172.21.15.60:6834/1434001575,v1:172.21.15.60:6835/1434001575] <== mon.2 v2:172.21.15.60:3301/0 191 ==== mdsbeacon(5375/d up:active seq=97 v160) v8 ==== 130+0+0 (secure 0 0 0) 0x55957f09f600 con 0x55957bc11400
 -1675> 2023-07-04T21:11:07.992+0000 7f2bbcb14700  5 mds.beacon.d received beacon reply up:active seq 97 rtt 0.00099999
 -1674> 2023-07-04T21:11:07.992+0000 7f2bb830b700 20 mds.beacon.d sender thread waiting interval 3.999s
 -1673> 2023-07-04T21:11:08.049+0000 7f2bb8b0c700 20 mds.0.118 get_task_status
 -1672> 2023-07-04T21:11:08.049+0000 7f2bb8b0c700 20 mds.0.118 schedule_update_timer_task

  -167> 2023-07-04T21:11:10.747+0000 7f2bba30f700  7 mds.0.cache handle_cache_rejoin cache_rejoin ack v2 from mds.2 (110185 bytes)
  -166> 2023-07-04T21:11:10.747+0000 7f2bba30f700  7 mds.0.cache handle_cache_rejoin_ack from mds.2

And also from the source code's comment that this could happen when the subtree was trimmed before receiving the cache_rejoin ack:

 4940 void MDCache::handle_cache_rejoin_ack(const cref_t<MMDSCacheRejoin> &ack)
 4941 {
...
 5128  
 5129   // FIXME: This can happen if entire subtree, together with the inode subtree root
 5130   // belongs to, were trimmed between sending cache rejoin and receiving rejoin ack.
 5131   ceph_assert(isolated_inodes.empty());                                                                                                                                                                                          
 5132  
...

So we need to defer trimming the subtrees until the last cache_rejoin ack finishes.

Actions

Copy link