Project

General

Profile

Bug #55170

mds: crash during rejoin (CDir::fetch_keys)

Added by Venky Shankar 8 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
multimds
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Seen here: https://pulpito.ceph.com/vshankar-2022-04-03_15:22:48-fs-wip-vshankar-testing-20220403-170344-testing-default-smithi/6775379/

    -5> 2022-04-03T18:21:56.273+0000 7f339054b700 10 mds.1.cache rejoin_gather_finish
    -4> 2022-04-03T18:21:56.273+0000 7f339054b700 10 mds.1.cache open_undef_inodes_dirfrags 21 inodes 0 dirfrags
    -3> 2022-04-03T18:21:56.273+0000 7f339054b700 10 mds.1.cache.dir(0x613.111*) fetch_keys 0 keys on [dir 0x613.111* ~mds1/stray9/ [2,head] auth{0=6} v=21060 cv=0/0 state=1610612736 f(v0 m2022-04-03T18:08:26.278322+0000 22=0+22)/f(v0 m2022-04-03T18:08:26.278322+0000 64=42+22) n(v3 rc2022-04-03T18:08:26.278322+0000 22=0+22) hs=22+406,ss=0+0 dirty=425 | child=1 replicated=1 dirty=1 0x558167b94480]
    -2> 2022-04-03T18:21:56.273+0000 7f339054b700  7 mds.1.cache.dir(0x613.111*) fetch keys, all are already being fetched
    -1> 2022-04-03T18:21:56.275+0000 7f339054b700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-11442-gcfb8f943/rpm/el8/BUILD/ceph-17.0.0-11442-gcfb8f943/src/mds/CDir.cc: In function 'void CDir::fetch_keys(const std::vector<dentry_key_t>&, MDSContext*)' thread 7f339054b700 time 2022-04-03T18:21:56.275271+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-11442-gcfb8f943/rpm/el8/BUILD/ceph-17.0.0-11442-gcfb8f943/src/mds/CDir.cc: 1640: FAILED ceph_assert(!c)

 ceph version 17.0.0-11442-gcfb8f943 (cfb8f943163b374162da0d7b0240f267dd46e4e1) quincy (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f3398d6a144]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x284365) [0x7f3398d6a365]
 3: (CDir::fetch_keys(std::vector<dentry_key_t, std::allocator<dentry_key_t> > const&, MDSContext*)+0x43c) [0x55816165542c]
 4: (MDCache::open_undef_inodes_dirfrags()+0x6d6) [0x5581615339c6]
 5: (MDCache::rejoin_gather_finish()+0xa8) [0x558161540f78]
 6: (MDCache::handle_cache_rejoin_strong(boost::intrusive_ptr<MMDSCacheRejoin const> const&)+0x30f1) [0x55816154cbc1]
 7: (MDCache::handle_cache_rejoin(boost::intrusive_ptr<MMDSCacheRejoin const> const&)+0xdb) [0x558161550d0b]
 8: (MDCache::dispatch(boost::intrusive_ptr<Message const> const&)+0x354) [0x558161551214]
 9: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0x942) [0x5581613d8502]
 10: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7cb) [0x5581613db54b]
 11: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x5c) [0x5581613dbb6c]
 12: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x5581613caa68]
 13: (DispatchQueue::entry()+0x14fa) [0x7f3398ff1f7a]
 14: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f33990a9651]
 15: /lib64/libpthread.so.0(+0x814a) [0x7f3397d4314a]
 16: clone()

     0> 2022-04-03T18:21:56.278+0000 7f339054b700 -1 *** Caught signal (Aborted) **
 in thread 7f339054b700 thread_name:ms_dispatch

Test matrix:

Description: fs/thrash/workloads/{begin/{0-install 1-ceph 2-logrotate} clusters/1a5s-mds-1c-client conf/{client mds mon osd} distro/{rhel_8} mount/fuse msgr-failures/osd-mds-delay objectstore-ec/bluestore-comp-ec-root overrides/{frag prefetch_dirfrags/no prefetch_entire_dirfrags/no races session_timeout thrashosds-health whitelist_health whitelist_wrongly_marked_down} ranks/3 tasks/{1-thrash/osd 2-workunit/fs/snaps}}

Crash seems unrealted to PRs being tested in the branch.

History

#1 Updated by Venky Shankar 8 months ago

  • Status changed from New to Triaged
  • Assignee set to Venky Shankar

#2 Updated by Venky Shankar 7 months ago

  • Status changed from Triaged to Fix Under Review
  • Pull request ID set to 46063

#3 Updated by Venky Shankar 7 months ago

  • Backport deleted (quincy, pacific)

#4 Updated by Venky Shankar 7 months ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF