Bug #55537
mds: crash during fs:upgrade test
0%
Description
- https://pulpito.ceph.com/vshankar-2022-05-02_16:58:59-fs-wip-vshankar-testing1-20220502-201957-testing-default-smithi/6818727/
- https://pulpito.ceph.com/vshankar-2022-05-02_16:58:59-fs-wip-vshankar-testing1-20220502-201957-testing-default-smithi/6818695/
Backtrace:
2022-05-02T17:21:18.407 INFO:tasks.ceph.mds.b.smithi007.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-11928-gaf5caccc/rpm/el8/BUILD/ceph-17.0.0-11928-gaf5caccc/src/mds/CDir.cc: In function 'bool CDir::check_rstats(bool)' thread 7f251efed700 time 2022-05-02T17:21:18.405408+0000 2022-05-02T17:21:18.407 INFO:tasks.ceph.mds.b.smithi007.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-11928-gaf5caccc/rpm/el8/BUILD/ceph-17.0.0-11928-gaf5caccc/src/mds/CDir.cc: 294: FAILED ceph_assert(frag_info.nsubdirs == fnode->fragstat.nsubdirs) 2022-05-02T17:21:18.408 INFO:tasks.ceph.mds.b.smithi007.stderr: ceph version 17.0.0-11928-gaf5caccc (af5caccc3a270ee8c7fe4530fa943e48e8028552) quincy (dev) 2022-05-02T17:21:18.408 INFO:tasks.ceph.mds.b.smithi007.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f252e8851c4] 2022-05-02T17:21:18.408 INFO:tasks.ceph.mds.b.smithi007.stderr: 2: /usr/lib64/ceph/libceph-common.so.2(+0x2853e5) [0x7f252e8853e5] 2022-05-02T17:21:18.409 INFO:tasks.ceph.mds.b.smithi007.stderr: 3: (CDir::check_rstats(bool)+0x17dd) [0x55618923a5dd] 2022-05-02T17:21:18.409 INFO:tasks.ceph.mds.b.smithi007.stderr: 4: (MDCache::predirty_journal_parents(boost::intrusive_ptr<MutationImpl>, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0x12a9) [0x5561890de789] 2022-05-02T17:21:18.409 INFO:tasks.ceph.mds.b.smithi007.stderr: 5: (MDCache::_create_system_file(CDir*, std::basic_string_view<char, std::char_traits<char> >, CInode*, MDSContext*)+0x64e) [0x5561890e011e] 2022-05-02T17:21:18.409 INFO:tasks.ceph.mds.b.smithi007.stderr: 6: (MDCache::populate_mydir()+0x846) [0x5561891232c6] 2022-05-02T17:21:18.409 INFO:tasks.ceph.mds.b.smithi007.stderr: 7: (MDCache::open_root()+0xd8) [0x556189123608] 2022-05-02T17:21:18.409 INFO:tasks.ceph.mds.b.smithi007.stderr: 8: (C_MDS_RetryOpenRoot::finish(int)+0x27) [0x556189192df7] 2022-05-02T17:21:18.410 INFO:tasks.ceph.mds.b.smithi007.stderr: 9: (MDSContext::complete(int)+0x203) [0x5561892fbc03] 2022-05-02T17:21:18.410 INFO:tasks.ceph.mds.b.smithi007.stderr: 10: (MDSRank::_advance_queues()+0x84) [0x556188fda764] 2022-05-02T17:21:18.410 INFO:tasks.ceph.mds.b.smithi007.stderr: 11: (MDSRank::ProgressThread::entry()+0xc5) [0x556188fdae95] 2022-05-02T17:21:18.410 INFO:tasks.ceph.mds.b.smithi007.stderr: 12: /lib64/libpthread.so.0(+0x81cf) [0x7f252d0081cf] 2022-05-02T17:21:18.410 INFO:tasks.ceph.mds.b.smithi007.stderr: 13: clone()
Related issues
History
#1 Updated by Venky Shankar almost 2 years ago
- Status changed from New to Triaged
- Assignee set to Ramana Raja
#2 Updated by Venky Shankar over 1 year ago
- Assignee changed from Ramana Raja to Venky Shankar
- Component(FS) tools added
- Labels (FS) crash added
This is pretty easily reproducible with the following steps:
mkdir -p /mnt/cephfs/dirx/diry cp /etc/hosts /mnt/cephfs/dirx/diry
Flush the journal so that the omap entries get written:
ceph tell mds.<> flush journal
Get the inode number of /mnt/cephfs/dirx/diry - say its 0x10000000001.
Shutdown all MDSs:
ceph fs fail <fs>
Delete the directroy object
rados -p <meta-pool> rm 10000000001.00000000
(File `hosts` is unavailable at this point). Start metadata recovery procedure:
cephfs-data-scan scan_extents <data-pool> cephfs-data-scan scan_inodes <data-pool> cephfs-data-scan scan_links
The directory object gets recovered:
rados -p <meta-pool> listomapvals 10000000001.00000000
Bring file system back online:
ceph fs set <fs> joinable true
(File `hosts` is available under /mnt/cephfs/dirx/diry). Create another file under /mnt/cephfs/dirx/diry:
cp /etc/resolv.conf /mnt/cephfs/dirx/diry
MDS crashes with backtrace:
-8> 2022-12-14T23:21:38.559-0500 7ff09e6d0640 10 mds.0.cache project_rstat_frag_to_inode [2,head] -7> 2022-12-14T23:21:38.559-0500 7ff09e6d0640 20 mds.0.cache frag rstat n(v0 rc2022-12-14T23:21:38.563794-0500 1=1+0) -6> 2022-12-14T23:21:38.559-0500 7ff09e6d0640 20 mds.0.cache frag accounted_rstat n() -5> 2022-12-14T23:21:38.559-0500 7ff09e6d0640 20 mds.0.cache delta n(v0 rc2022-12-14T23:21:38.563794-0500 1=1+0) -4> 2022-12-14T23:21:38.559-0500 7ff09e6d0640 20 mds.0.cache projecting to [2,head] n(v0 rc2022-12-14T23:15:09.271528-0500 b260 2=1+1) -3> 2022-12-14T23:21:38.559-0500 7ff09e6d0640 20 mds.0.cache result [2,head] n(v0 rc2022-12-14T23:21:38.563794-0500 b260 3=2+1) -2> 2022-12-14T23:21:38.559-0500 7ff09e6d0640 -1 log_channel(cluster) log [ERR] : unmatched rstat rbytes on single dirfrag 0x10000000001, inode has n(v\ 0 rc2022-12-14T23:21:38.563794-0500 b260 3=2+1), dirfrag has n(v0 rc2022-12-14T23:21:38.563794-0500 1=1+0) -1> 2022-12-14T23:21:38.575-0500 7ff09e6d0640 -1 /work/ceph/src/mds/MDCache.cc: In function 'void MDCache::predirty_journal_parents(MutationRef, EMetaB\ lob*, CInode*, CDir*, int, int, snapid_t)' thread 7ff09e6d0640 time 2022-12-14T23:21:38.566190-0500 /work/ceph/src/mds/MDCache.cc: 2361: FAILED ceph_assert(!"unmatched rstat rbytes" == g_conf()->mds_verify_scatter) ceph version 18.0.0-758-g7502beb21a9 (7502beb21a916a23dbf8a0812132964fb441ea3a) reef (dev) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x125) [0x7ff0a4eaab08] 2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7ff0a4eaad0f] 3: (MDCache::predirty_journal_parents(boost::intrusive_ptr<MutationImpl>, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0x1b06) [0x5591444c4608] 4: (Server::handle_client_openc(boost::intrusive_ptr<MDRequestImpl>&)+0xd30) [0x55914445fdbe] 5: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xbe4) [0x55914447259a] 6: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0xd5f) [0x55914447360b] 7: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0xa4) [0x5591444769d8] 8: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0x40f) [0x5591443ea4df] 9: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x217) [0x5591443ec303] 10: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x1c3) [0x5591443ecc0b] 11: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1cc) [0x5591443dcc9e] 12: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0xa4) [0x7ff0a4fd545a] 13: (DispatchQueue::entry()+0x3e1) [0x7ff0a4fd1aa1] 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff0a5072967] 15: (Thread::entry_wrapper()+0x3f) [0x7ff0a4e8e411] 16: (Thread::_entry_func(void*)+0x9) [0x7ff0a4e8e429] 17: /lib/x86_64-linux-gnu/libc.so.6(+0x87b27) [0x7ff0a4239b27] 18: /lib/x86_64-linux-gnu/libc.so.6(+0x10a78c) [0x7ff0a42bc78c]
Although the backtrace is different, I think the underlying issue is same for both crashes (rstat/fragstat mismatch).
#3 Updated by Venky Shankar over 1 year ago
- Related to Bug #57087: qa: test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan) failure added
#4 Updated by Patrick Donnelly over 1 year ago
I think that assert is expected when using cephfs-data-scan. You need to run scrub and disable those configs before turning the configs back on.
#5 Updated by Venky Shankar over 1 year ago
Patrick Donnelly wrote:
I think that assert is expected when using cephfs-data-scan. You need to run scrub and disable those configs before turning the configs back on.
Really? Running scrub will ensure that rbytes in the inode and the dirfrag are matched. The document (metadata recovery) does say "After recovery, some recovered directories will have incorrect statistics", but, we do not mention to run scrub before using the file system or the MDS would crash.
#6 Updated by Patrick Donnelly over 1 year ago
Venky Shankar wrote:
Patrick Donnelly wrote:
I think that assert is expected when using cephfs-data-scan. You need to run scrub and disable those configs before turning the configs back on.
Really? Running scrub will ensure that rbytes in the inode and the dirfrag are matched. The document (metadata recovery) does say "After recovery, some recovered directories will have incorrect statistics", but, we do not mention to run scrub before using the file system or the MDS would crash.
The MDS won't crash if the configs are off (the default). We turn them on globally in QA.
#7 Updated by Venky Shankar over 1 year ago
Patrick Donnelly wrote:
Venky Shankar wrote:
Patrick Donnelly wrote:
I think that assert is expected when using cephfs-data-scan. You need to run scrub and disable those configs before turning the configs back on.
Really? Running scrub will ensure that rbytes in the inode and the dirfrag are matched. The document (metadata recovery) does say "After recovery, some recovered directories will have incorrect statistics", but, we do not mention to run scrub before using the file system or the MDS would crash.
The MDS won't crash if the configs are off (the default). We turn them on globally in QA.
The config (mds_verify_scatteR) is off - https://tracker.ceph.com/issues/55537#note-2 is from a vstart cluster.
#8 Updated by Patrick Donnelly 6 months ago
- Target version deleted (
v18.0.0)