Project

General

Profile

Bug #55537

mds: crash during fs:upgrade test

Added by Venky Shankar almost 2 years ago. Updated 6 months ago.

Status:
Triaged
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
quincy, pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, tools
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

- https://pulpito.ceph.com/vshankar-2022-05-02_16:58:59-fs-wip-vshankar-testing1-20220502-201957-testing-default-smithi/6818727/
- https://pulpito.ceph.com/vshankar-2022-05-02_16:58:59-fs-wip-vshankar-testing1-20220502-201957-testing-default-smithi/6818695/

Backtrace:

2022-05-02T17:21:18.407 INFO:tasks.ceph.mds.b.smithi007.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-11928-gaf5caccc/rpm/el8/BUILD/ceph-17.0.0-11928-gaf5caccc/src/mds/CDir.cc: In function 'bool CDir::check_rstats(bool)' thread 7f251efed700 time 2022-05-02T17:21:18.405408+0000
2022-05-02T17:21:18.407 INFO:tasks.ceph.mds.b.smithi007.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-11928-gaf5caccc/rpm/el8/BUILD/ceph-17.0.0-11928-gaf5caccc/src/mds/CDir.cc: 294: FAILED ceph_assert(frag_info.nsubdirs == fnode->fragstat.nsubdirs)
2022-05-02T17:21:18.408 INFO:tasks.ceph.mds.b.smithi007.stderr: ceph version 17.0.0-11928-gaf5caccc (af5caccc3a270ee8c7fe4530fa943e48e8028552) quincy (dev)
2022-05-02T17:21:18.408 INFO:tasks.ceph.mds.b.smithi007.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f252e8851c4]
2022-05-02T17:21:18.408 INFO:tasks.ceph.mds.b.smithi007.stderr: 2: /usr/lib64/ceph/libceph-common.so.2(+0x2853e5) [0x7f252e8853e5]
2022-05-02T17:21:18.409 INFO:tasks.ceph.mds.b.smithi007.stderr: 3: (CDir::check_rstats(bool)+0x17dd) [0x55618923a5dd]
2022-05-02T17:21:18.409 INFO:tasks.ceph.mds.b.smithi007.stderr: 4: (MDCache::predirty_journal_parents(boost::intrusive_ptr<MutationImpl>, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0x12a9) [0x5561890de789]
2022-05-02T17:21:18.409 INFO:tasks.ceph.mds.b.smithi007.stderr: 5: (MDCache::_create_system_file(CDir*, std::basic_string_view<char, std::char_traits<char> >, CInode*, MDSContext*)+0x64e) [0x5561890e011e]
2022-05-02T17:21:18.409 INFO:tasks.ceph.mds.b.smithi007.stderr: 6: (MDCache::populate_mydir()+0x846) [0x5561891232c6]
2022-05-02T17:21:18.409 INFO:tasks.ceph.mds.b.smithi007.stderr: 7: (MDCache::open_root()+0xd8) [0x556189123608]
2022-05-02T17:21:18.409 INFO:tasks.ceph.mds.b.smithi007.stderr: 8: (C_MDS_RetryOpenRoot::finish(int)+0x27) [0x556189192df7]
2022-05-02T17:21:18.410 INFO:tasks.ceph.mds.b.smithi007.stderr: 9: (MDSContext::complete(int)+0x203) [0x5561892fbc03]
2022-05-02T17:21:18.410 INFO:tasks.ceph.mds.b.smithi007.stderr: 10: (MDSRank::_advance_queues()+0x84) [0x556188fda764]
2022-05-02T17:21:18.410 INFO:tasks.ceph.mds.b.smithi007.stderr: 11: (MDSRank::ProgressThread::entry()+0xc5) [0x556188fdae95]
2022-05-02T17:21:18.410 INFO:tasks.ceph.mds.b.smithi007.stderr: 12: /lib64/libpthread.so.0(+0x81cf) [0x7f252d0081cf]
2022-05-02T17:21:18.410 INFO:tasks.ceph.mds.b.smithi007.stderr: 13: clone()

Related issues

Related to CephFS - Bug #57087: qa: test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan) failure Pending Backport

History

#1 Updated by Venky Shankar almost 2 years ago

  • Status changed from New to Triaged
  • Assignee set to Ramana Raja

#2 Updated by Venky Shankar over 1 year ago

  • Assignee changed from Ramana Raja to Venky Shankar
  • Component(FS) tools added
  • Labels (FS) crash added

This is pretty easily reproducible with the following steps:

mkdir -p /mnt/cephfs/dirx/diry
cp /etc/hosts /mnt/cephfs/dirx/diry

Flush the journal so that the omap entries get written:

ceph tell mds.<> flush journal

Get the inode number of /mnt/cephfs/dirx/diry - say its 0x10000000001.

Shutdown all MDSs:

ceph fs fail <fs>

Delete the directroy object

rados -p <meta-pool> rm 10000000001.00000000

(File `hosts` is unavailable at this point). Start metadata recovery procedure:

cephfs-data-scan scan_extents <data-pool>
cephfs-data-scan scan_inodes <data-pool>
cephfs-data-scan scan_links

The directory object gets recovered:

rados -p <meta-pool> listomapvals 10000000001.00000000

Bring file system back online:

ceph fs set <fs> joinable true

(File `hosts` is available under /mnt/cephfs/dirx/diry). Create another file under /mnt/cephfs/dirx/diry:

cp /etc/resolv.conf /mnt/cephfs/dirx/diry

MDS crashes with backtrace:

    -8> 2022-12-14T23:21:38.559-0500 7ff09e6d0640 10 mds.0.cache project_rstat_frag_to_inode [2,head]
    -7> 2022-12-14T23:21:38.559-0500 7ff09e6d0640 20 mds.0.cache   frag           rstat n(v0 rc2022-12-14T23:21:38.563794-0500 1=1+0)
    -6> 2022-12-14T23:21:38.559-0500 7ff09e6d0640 20 mds.0.cache   frag accounted_rstat n()
    -5> 2022-12-14T23:21:38.559-0500 7ff09e6d0640 20 mds.0.cache                  delta n(v0 rc2022-12-14T23:21:38.563794-0500 1=1+0)
    -4> 2022-12-14T23:21:38.559-0500 7ff09e6d0640 20 mds.0.cache  projecting to [2,head] n(v0 rc2022-12-14T23:15:09.271528-0500 b260 2=1+1)
    -3> 2022-12-14T23:21:38.559-0500 7ff09e6d0640 20 mds.0.cache         result [2,head] n(v0 rc2022-12-14T23:21:38.563794-0500 b260 3=2+1)
    -2> 2022-12-14T23:21:38.559-0500 7ff09e6d0640 -1 log_channel(cluster) log [ERR] : unmatched rstat rbytes on single dirfrag 0x10000000001, inode has n(v\
0 rc2022-12-14T23:21:38.563794-0500 b260 3=2+1), dirfrag has n(v0 rc2022-12-14T23:21:38.563794-0500 1=1+0)
    -1> 2022-12-14T23:21:38.575-0500 7ff09e6d0640 -1 /work/ceph/src/mds/MDCache.cc: In function 'void MDCache::predirty_journal_parents(MutationRef, EMetaB\
lob*, CInode*, CDir*, int, int, snapid_t)' thread 7ff09e6d0640 time 2022-12-14T23:21:38.566190-0500
/work/ceph/src/mds/MDCache.cc: 2361: FAILED ceph_assert(!"unmatched rstat rbytes" == g_conf()->mds_verify_scatter)

 ceph version 18.0.0-758-g7502beb21a9 (7502beb21a916a23dbf8a0812132964fb441ea3a) reef (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x125) [0x7ff0a4eaab08]
 2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7ff0a4eaad0f]
 3: (MDCache::predirty_journal_parents(boost::intrusive_ptr<MutationImpl>, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0x1b06) [0x5591444c4608]
 4: (Server::handle_client_openc(boost::intrusive_ptr<MDRequestImpl>&)+0xd30) [0x55914445fdbe]
 5: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xbe4) [0x55914447259a]
 6: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0xd5f) [0x55914447360b]
 7: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0xa4) [0x5591444769d8]
 8: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0x40f) [0x5591443ea4df]
 9: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x217) [0x5591443ec303]
 10: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x1c3) [0x5591443ecc0b]
 11: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1cc) [0x5591443dcc9e]
 12: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0xa4) [0x7ff0a4fd545a]
 13: (DispatchQueue::entry()+0x3e1) [0x7ff0a4fd1aa1]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff0a5072967]
 15: (Thread::entry_wrapper()+0x3f) [0x7ff0a4e8e411]
 16: (Thread::_entry_func(void*)+0x9) [0x7ff0a4e8e429]
 17: /lib/x86_64-linux-gnu/libc.so.6(+0x87b27) [0x7ff0a4239b27]
 18: /lib/x86_64-linux-gnu/libc.so.6(+0x10a78c) [0x7ff0a42bc78c]

Although the backtrace is different, I think the underlying issue is same for both crashes (rstat/fragstat mismatch).

#3 Updated by Venky Shankar over 1 year ago

  • Related to Bug #57087: qa: test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan) failure added

#4 Updated by Patrick Donnelly over 1 year ago

I think that assert is expected when using cephfs-data-scan. You need to run scrub and disable those configs before turning the configs back on.

#5 Updated by Venky Shankar over 1 year ago

Patrick Donnelly wrote:

I think that assert is expected when using cephfs-data-scan. You need to run scrub and disable those configs before turning the configs back on.

Really? Running scrub will ensure that rbytes in the inode and the dirfrag are matched. The document (metadata recovery) does say "After recovery, some recovered directories will have incorrect statistics", but, we do not mention to run scrub before using the file system or the MDS would crash.

#6 Updated by Patrick Donnelly over 1 year ago

Venky Shankar wrote:

Patrick Donnelly wrote:

I think that assert is expected when using cephfs-data-scan. You need to run scrub and disable those configs before turning the configs back on.

Really? Running scrub will ensure that rbytes in the inode and the dirfrag are matched. The document (metadata recovery) does say "After recovery, some recovered directories will have incorrect statistics", but, we do not mention to run scrub before using the file system or the MDS would crash.

The MDS won't crash if the configs are off (the default). We turn them on globally in QA.

#7 Updated by Venky Shankar over 1 year ago

Patrick Donnelly wrote:

Venky Shankar wrote:

Patrick Donnelly wrote:

I think that assert is expected when using cephfs-data-scan. You need to run scrub and disable those configs before turning the configs back on.

Really? Running scrub will ensure that rbytes in the inode and the dirfrag are matched. The document (metadata recovery) does say "After recovery, some recovered directories will have incorrect statistics", but, we do not mention to run scrub before using the file system or the MDS would crash.

The MDS won't crash if the configs are off (the default). We turn them on globally in QA.

The config (mds_verify_scatteR) is off - https://tracker.ceph.com/issues/55537#note-2 is from a vstart cluster.

#8 Updated by Patrick Donnelly 6 months ago

  • Target version deleted (v18.0.0)

Also available in: Atom PDF