Project

General

Profile

Actions

Bug #61407

open

mds: abort on CInode::verify_dirfrags

Added by Patrick Donnelly 11 months ago. Updated 17 days ago.

Status:
New
Priority:
Urgent
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
reef,quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

   -18> 2023-05-24T05:09:18.013+0000 7f44141f6700 14 mds.1.cache.ino(0x10000002fe5) close_dirfrag 101010011*
   -17> 2023-05-24T05:09:18.013+0000 7f44141f6700 12 mds.1.cache.dir(0x10000002fe5.101010011*) remove_null_dentries [dir 0x10000002fe5.101010011* ~mds3/stray9/10000002fe5/ [2,head] rep@3.1 state=0 f(v72) n(v77) hs=0+0,ss=0+0 0x5637cc364480]
   -16> 2023-05-24T05:09:18.013+0000 7f44141f6700  0 mds.1.cache.ino(0x10000002fe5) have open dirfrag 101010100* but not leaf in fragtree_t(*^2 00*^3 00000*^1 00001*^1 000011*^1 0000110*^2 00010*^1 00011*^1 00100*^1 00101*^1 00110*^1 001101*^1 0011011*^1 00110111*^1 01*^3 01000*^1 010000*^1 0100000*^1 01000001*^2 0100000100*^1 0100000101*^1 0100000111*^1 01111*^1 011111*^2 01111111*^2 0111111100*^1 0111111111*^1 10*^3 10000*^1 10010*^1 100100*^2 10010001*^1 10010010*^1 100101*^2 10010100*^1 10010110*^1 10010111*^1 10011*^1 10100*^1 10101*^1 101010*^1 1010101*^1 10101011*^1 10110*^2 1011000*^2 1011001*^2 1011011*^1 10110110*^1 11*^2 1101*^1 11010*^2 1101000*^1 1101010*^1 1101011*^1 11011*^2 1101100*^1 1101110*^1 1101111*^1 1110*^1 1111*^1 11110*^2 1111000*^1 1111001*^1 1111011*^1): [dir 0x10000002fe5.101010100* ~mds3/stray9/10000002fe5/ [2,head] rep@3.1 state=0 f(v72) n(v77) hs=0+0,ss=0+0 0x5637cb3a6900]
   -15> 2023-05-24T05:09:18.013+0000 7f44141f6700  0 mds.1.cache.ino(0x10000002fe5) have open dirfrag 101010101* but not leaf in fragtree_t(*^2 00*^3 00000*^1 00001*^1 000011*^1 0000110*^2 00010*^1 00011*^1 00100*^1 00101*^1 00110*^1 001101*^1 0011011*^1 00110111*^1 01*^3 01000*^1 010000*^1 0100000*^1 01000001*^2 0100000100*^1 0100000101*^1 0100000111*^1 01111*^1 011111*^2 01111111*^2 0111111100*^1 0111111111*^1 10*^3 10000*^1 10010*^1 100100*^2 10010001*^1 10010010*^1 100101*^2 10010100*^1 10010110*^1 10010111*^1 10011*^1 10100*^1 10101*^1 101010*^1 1010101*^1 10101011*^1 10110*^2 1011000*^2 1011001*^2 1011011*^1 10110110*^1 11*^2 1101*^1 11010*^2 1101000*^1 1101010*^1 1101011*^1 11011*^2 1101100*^1 1101110*^1 1101111*^1 1110*^1 1111*^1 11110*^2 1111000*^1 1111001*^1 1111011*^1): [dir 0x10000002fe5.101010101* ~mds3/stray9/10000002fe5/ [2,head] rep@3.1 state=0 f(v72) n(v77) hs=0+0,ss=0+0 0x5637cba5c000]
   -14> 2023-05-24T05:09:18.307+0000 7f44171fc700  1 -- [v2:172.21.15.38:6834/668858652,v1:172.21.15.38:6835/668858652] <== mds.0 v2:172.21.15.38:6836/3553899173 23500 ==== mdsping v1 ==== 8+0+0 (crc 0 0 0) 0x5637c958c640 con 0x5637c642c400
   -13> 2023-05-24T05:09:18.307+0000 7f44171fc700 20 set_next_seq: mds.metrics: current sequence number 629, setting next sequence number 630
   -12> 2023-05-24T05:09:18.377+0000 7f44171fc700  1 -- [v2:172.21.15.38:6834/668858652,v1:172.21.15.38:6835/668858652] >> [v2:172.21.15.77:6834/3403643132,v1:172.21.15.77:6836/3403643132] conn(0x5637c5445400 msgr2=0x5637c640d700 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 35
   -11> 2023-05-24T05:09:18.377+0000 7f44171fc700  1 -- [v2:172.21.15.38:6834/668858652,v1:172.21.15.38:6835/668858652] >> [v2:172.21.15.77:6834/3403643132,v1:172.21.15.77:6836/3403643132] conn(0x5637c5445400 msgr2=0x5637c640d700 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed
   -10> 2023-05-24T05:09:18.377+0000 7f44171fc700  1 --2- [v2:172.21.15.38:6834/668858652,v1:172.21.15.38:6835/668858652] >> [v2:172.21.15.77:6834/3403643132,v1:172.21.15.77:6836/3403643132] conn(0x5637c5445400 0x5637c640d700 crc :-1 s=READY pgs=753 cs=108 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1
    -9> 2023-05-24T05:09:18.377+0000 7f44171fc700  1 --2- [v2:172.21.15.38:6834/668858652,v1:172.21.15.38:6835/668858652] >> [v2:172.21.15.77:6834/3403643132,v1:172.21.15.77:6836/3403643132] conn(0x5637c5445400 0x5637c640d700 unknown :-1 s=READY pgs=753 cs=108 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._fault with nothing to send, going to standby
    -8> 2023-05-24T05:09:18.381+0000 7f44169fb700  1 -- [v2:172.21.15.38:6834/668858652,v1:172.21.15.38:6835/668858652] >> [v2:172.21.15.77:6835/1221529923,v1:172.21.15.77:6837/1221529923] conn(0x5637c5445000 msgr2=0x5637c54df080 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 33
    -7> 2023-05-24T05:09:18.381+0000 7f44169fb700  1 -- [v2:172.21.15.38:6834/668858652,v1:172.21.15.38:6835/668858652] >> [v2:172.21.15.77:6835/1221529923,v1:172.21.15.77:6837/1221529923] conn(0x5637c5445000 msgr2=0x5637c54df080 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed
    -6> 2023-05-24T05:09:18.381+0000 7f44169fb700  1 --2- [v2:172.21.15.38:6834/668858652,v1:172.21.15.38:6835/668858652] >> [v2:172.21.15.77:6835/1221529923,v1:172.21.15.77:6837/1221529923] conn(0x5637c5445000 0x5637c54df080 unknown :-1 s=READY pgs=914 cs=126 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1
    -5> 2023-05-24T05:09:18.381+0000 7f44169fb700  1 --2- [v2:172.21.15.38:6834/668858652,v1:172.21.15.38:6835/668858652] >> [v2:172.21.15.77:6835/1221529923,v1:172.21.15.77:6837/1221529923] conn(0x5637c5445000 0x5637c54df080 unknown :-1 s=READY pgs=914 cs=126 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._fault with nothing to send, going to standby
    -4> 2023-05-24T05:09:18.517+0000 7f44131f4700 10 monclient: tick
    -3> 2023-05-24T05:09:18.517+0000 7f44131f4700 10 monclient: _check_auth_tickets
    -2> 2023-05-24T05:09:18.517+0000 7f44131f4700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2023-05-24T05:08:48.519327+0000)
    -1> 2023-05-24T05:09:18.554+0000 7f44141f6700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-4167-gfa0e62c4/rpm/el8/BUILD/ceph-18.0.0-4167-gfa0e62c4/src/mds/CInode.cc: In function 'void CInode::verify_dirfrags()' thread 7f44141f6700 time 2023-05-24T05:09:18.014914+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-4167-gfa0e62c4/rpm/el8/BUILD/ceph-18.0.0-4167-gfa0e62c4/src/mds/CInode.cc: 751: FAILED ceph_assert(!bad)

 ceph version 18.0.0-4167-gfa0e62c4 (fa0e62c4a1d8e4a737d9cbe50224f70009b79b28) reef (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f44246a9dbf]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x2a9f85) [0x7f44246a9f85]
 3: (CInode::verify_dirfrags()+0x4a4) [0x5637c30ff2a4]
 4: (MDCache::handle_fragment_notify(boost::intrusive_ptr<MMDSFragmentNotify const> const&)+0x28c) [0x5637c2f7eeec]
 5: (MDCache::dispatch(boost::intrusive_ptr<Message const> const&)+0x3a4) [0x5637c2fc2824]
 6: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x5a3) [0x5637c2df4613]
 7: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x5c) [0x5637c2df4c2c]
 8: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1bf) [0x5637c2ddee6f]
 9: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x478) [0x7f4424922fc8]
 10: (DispatchQueue::entry()+0x50f) [0x7f442492016f]
 11: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f44249e64c1]
 12: /lib64/libpthread.so.0(+0x81cf) [0x7f4422a081cf]
 13: clone()

From: /teuthology/pdonnell-2023-05-23_18:20:18-fs-wip-pdonnell-testing-20230523.134409-distro-default-smithi/7284327/remote/smithi038/log/ceph-mds.c.log.gz

Actions #1

Updated by Milind Changire 11 months ago

  • Assignee set to Milind Changire
Actions #2

Updated by Patrick Donnelly 11 months ago

Suggest you grep for this assert in older teuthology runs of `fs` suite to locate any missed past occurrences. That might help indicate if this a regression or not. Or, if it's just absurdly rare.

IIRC, this is a max_mds==5 cluster (as evidenced by the type of message the abort is from). So this appears to be some odd inconsistency. Pay special attention to the job yaml configuration as that may be related (some config change for example may cause this).

Actions #3

Updated by Milind Changire 6 months ago

The "mds debug frag: true" flag causes the verify_dirfrags() to run and sometimes cause an assertion failure which is the main reason for this ticket.

I guess, this config flag wouldn't be turned on in production and so everything is fine otherwise ?
Since the assertion failure happens in the handle_fragment_notify() path, this implies that it is the replica mds on which this assertion fails.

Actions #4

Updated by Venky Shankar 6 months ago

Milind Changire wrote:

The "mds debug frag: true" flag causes the verify_dirfrags() to run and sometimes cause an assertion failure which is the main reason for this ticket.

I guess, this config flag wouldn't be turned on in production and so everything is fine otherwise ?

Sounds right (the config is disabled by default).

Since the assertion failure happens in the handle_fragment_notify() path, this implies that it is the replica mds on which this assertion fails.

That's correct.

Actions #5

Updated by Milind Changire 17 days ago

old logs have been purged
deferring until a fresh qa report is available

Actions

Also available in: Atom PDF