Project

General

Profile

Actions

Bug #48517

closed

mds: "CDir.cc: 1530: FAILED ceph_assert(!is_complete())"

Added by Patrick Donnelly over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash, qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2020-12-09T04:21:21.274+0000 7fea84b9c700 10 mds.0.openfiles prefetch_inodes
2020-12-09T04:21:21.274+0000 7fea84b9c700 10 mds.0.openfiles _prefetch_inodes state 1
2020-12-09T04:21:21.274+0000 7fea84b9c700 10 mds.0.openfiles _prefetch_dirfrags
2020-12-09T04:21:21.274+0000 7fea84b9c700 10 mds.0.cache.dir(0x1) fetch on [dir 0x1 / [2,head] auth v=488 cv=0/0 dir_auth=0 state=1610874880 f(v0 m2020-12-09T04:13:41.581699+0000 1=0+1) n(v33 rc2020-12-09T04:20:35.914083+0000 b51685884 8790=7766+1024) hs=1+0,ss=0+0 dirty=1 | child=1 subtree=1 dirty=1 0x556e1f0f1180]
2020-12-09T04:21:21.274+0000 7fea84b9c700 10 mds.0.cache.dir(0x1) auth_pin by 0x556e1f0f1180 on [dir 0x1 / [2,head] auth v=488 cv=0/0 dir_auth=0 ap=1+0 state=1610874880 f(v0 m2020-12-09T04:13:41.581699+0000 1=0+1) n(v33 rc2020-12-09T04:20:35.914083+0000 b51685884 8790=7766+1024) hs=1+0,ss=0+0 dirty=1 | child=1 subtree=1 dirty=1 waiter=1 authpin=1 0x556e1f0f1180] count now 1
2020-12-09T04:21:21.274+0000 7fea84b9c700  1 -- [v2:172.21.15.73:6834/1493964564,v1:172.21.15.73:6835/1493964564] --> [v2:172.21.15.73:6824/16384,v1:172.21.15.73:6825/16384] -- osd_op(unknown.0.63:50 3.7 3:ff5b34d6:::1.00000000:head [omap-get-header,omap-get-vals in=16b,getxattr parent in=6b] snapc 0=[] ondisk+read+known_if_redirected+full_force e36) v8 -- 0x556e236b43c0 con 0x556e1f0f8000
2020-12-09T04:21:21.274+0000 7fea84b9c700 10 mds.0.cache.dir(0x1000000171d.0*) fetch on [dir 0x1000000171d.0* ~mds0/stray7/1000000171d/ [2,head] auth v=851 cv=0/0 state=1610612736 f()/f(v0 m2020-12-09T04:20:31.086183+0000) n(v5)/n(v5 rc2020-12-09T04:20:31.086183+0000) hs=0+244,ss=0+0 dirty=244 | child=1 dirty=1 0x556e241c9b00]
2020-12-09T04:21:21.274+0000 7fea84b9c700  7 mds.0.cache.dir(0x1000000171d.0*) fetch dirfrag for unlinked directory, mark complete
2020-12-09T04:21:21.274+0000 7fea84b9c700 10 mds.0.cache.dir(0x1000000171d.0*) fetch on [dir 0x1000000171d.0* ~mds0/stray7/1000000171d/ [2,head] auth v=851 cv=0/0 state=1610612737|complete f()/f(v0 m2020-12-09T04:20:31.086183+0000) n(v5)/n(v5 rc2020-12-09T04:20:31.086183+0000) hs=0+244,ss=0+0 dirty=244 | child=1 dirty=1 0x556e241c9b00]
2020-12-09T04:21:21.274+0000 7fea87ba2700  1 -- [v2:172.21.15.73:6834/1493964564,v1:172.21.15.73:6835/1493964564] <== osd.3 v2:172.21.15.73:6824/16384 3 ==== osd_op_reply(50 1.00000000 [omap-get-header out=274b,omap-get-vals out=5b,getxattr] v0'0 uv1 ondisk = 0) v8 ==== 238+0+279 (crc 0 0 0) 0x556e22af7cc0 con 0x556e1f0f8000
2020-12-09T04:21:21.274+0000 7fea7eb90700 10 MDSIOContextBase::complete: 21C_IO_Dir_OMAP_Fetched
2020-12-09T04:21:21.278+0000 7fea84b9c700 -1 /build/ceph-16.0.0-7967-ge4d2d676/src/mds/CDir.cc: In function 'void CDir::fetch(MDSContext*, std::string_view, bool)' thread 7fea84b9c700 time 2020-12-09T04:21:21.280738+0000
/build/ceph-16.0.0-7967-ge4d2d676/src/mds/CDir.cc: 1530: FAILED ceph_assert(!is_complete())

 ceph version 16.0.0-7967-ge4d2d676 (e4d2d67692ca960325f1d91d5ef9332c7cb9f37d) pacific (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14b) [0x7fea8cea1e2d]
 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7fea8cea2008]
 3: (CDir::fetch(MDSContext*, std::basic_string_view<char, std::char_traits<char> >, bool)+0xc29) [0x556e1ddae5b9]
 4: (CDir::fetch(MDSContext*, bool)+0x3a) [0x556e1ddae6da]
 5: (OpenFileTable::_prefetch_dirfrags()+0x4cc) [0x556e1de766dc]
 6: (OpenFileTable::_open_ino_finish(inodeno_t, int)+0x246) [0x556e1de77506]
 7: (OpenFileTable::_prefetch_inodes()+0x280) [0x556e1de75db0]
 8: (OpenFileTable::prefetch_inodes()+0xf8) [0x556e1de77728]
 9: (MDCache::process_imported_caps()+0x184) [0x556e1dcb2634]
 10: (MDCache::rejoin_start(MDSContext*)+0x82) [0x556e1dcb40b2]
 11: (MDSRank::rejoin_start()+0x162) [0x556e1db5aba2]
...

From: /ceph/teuthology-archive/pdonnell-2020-12-09_02:22:27-fs-wip-pdonnell-testing-20201207.220433-distro-basic-smithi/5694731/remote/smithi073/log/ceph-mds.a.log.gz

It seems we're doing two fetches on the same dirfrag. It might be we need to make this a std::set:

https://github.com/ceph/ceph/blob/1c6af5eb33f9dc172b3dd186cc0c567521fbfadd/src/mds/OpenFileTable.cc#L1056

however I'm not yet certain it's not a bug that we're adding the same dirfrag to the vector twice.

Actions #1

Updated by Patrick Donnelly over 3 years ago

  • Status changed from New to In Progress
  • Assignee set to Patrick Donnelly
  • Pull request ID set to 38512
Actions #2

Updated by Xiubo Li over 3 years ago

  • Assignee changed from Patrick Donnelly to Xiubo Li

As we talked last week, I will take this and will figure out why the same CDir was fetched twice.
Thanks.

Actions #3

Updated by Xiubo Li over 3 years ago

The MDS is doing the merge:

10100002:2020-12-09T04:21:19.478+0000 7fea7d38d700 10 mds.0.cache.dir(0x1000000171d.0*) merge [0x556e23daa000,0x556e23daa480,0x556e23daa900,0x556e23daad80]
10100003:2020-12-09T04:21:19.478+0000 7fea7d38d700 10 mds.0.cache.dir(0x1000000171d.0*) setting dir_auth=-1,-2 from -1,-2 on [dir 0x1000000171d.0* /client.0/tmp/fsstress/ltp-full-20091231/testcases/open_posix_testsuite/conformance/interfaces/sigaction/ [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 0x556e241c9b00]
10100004:2020-12-09T04:21:19.478+0000 7fea7d38d700 15 mds.0.cache.ino(0x1000000171d) maybe_export_pin update=0 [inode 0x1000000171d [...2,head] /client.0/tmp/fsstress/ltp-full-20091231/testcases/open_posix_testsuite/conformance/interfaces/sigaction/ auth fragtree_t(*^1 1*^2) v8892 DIRTYPARENT f(v0 m2020-12-09T04:20:35.050101+0000 24=23+1) n(v5 rc2020-12-09T04:20:35.050101+0000 b110802 45=43+2) (inest sync dirty) (iversion lock) | dirtyscattered=1 dirfrag=8 dirtyparent=1 dirty=1 0x556e23d2f700]
10100005:2020-12-09T04:21:19.478+0000 7fea7d38d700 10 mds.0.cache.dir(0x1000000171d.0*)  subfrag 000* [dir 0x1000000171d.000* /client.0/tmp/fsstress/ltp-full-20091231/testcases/open_posix_testsuite/conformance/interfaces/sigaction/ [2,head] auth v=705 cv=0/0 state=1610612737|complete f(v0 m2020-12-09T04:20:30.058203+0000) n(v5 rc2020-12-09T04:20:30.058203+0000) hs=0+55,ss=0+0 dirty=55 | child=1 dirty=1 0x556e23daa000]
...
10100062:2020-12-09T04:21:19.478+0000 7fea7d38d700 14 mds.0.cache.ino(0x1000000171d) close_dirfrag 000*
10100063:2020-12-09T04:21:19.478+0000 7fea7d38d700 12 mds.0.cache.dir(0x1000000171d.000*) remove_null_dentries [dir 0x1000000171d.000* /client.0/tmp/fsstress/ltp-full-20091231/testcases/open_posix_testsuite/conformance/interfaces/sigaction/ [2,head] auth v=705 cv=0/0 state=1073741825|complete f(v0 m2020-12-09T04:20:30.058203+0000) n(v5 rc2020-12-09T04:20:30.058203+0000) hs=0+0,ss=0+0 dirty=55 0x556e23daa000]
10100064:2020-12-09T04:21:19.478+0000 7fea7d38d700 10 mds.0.cache.dir(0x1000000171d.0*)  subfrag 001* [dir 0x1000000171d.001* /client.0/tmp/fsstress/ltp-full-20091231/testcases/open_posix_testsuite/conformance/interfaces/sigaction/ [2,head] auth v=746 cv=0/0 state=1610612737|complete f(v0 m2020-12-09T04:20:30.190201+0000) n(v5 rc2020-12-09T04:20:30.190201+0000) hs=0+58,ss=0+0 dirty=58 | child=1 dirty=1 0x556e23daa480]
....
10100124:2020-12-09T04:21:19.478+0000 7fea7d38d700 14 mds.0.cache.ino(0x1000000171d) close_dirfrag 001*
10100125:2020-12-09T04:21:19.478+0000 7fea7d38d700 12 mds.0.cache.dir(0x1000000171d.001*) remove_null_dentries [dir 0x1000000171d.001* /client.0/tmp/fsstress/ltp-full-20091231/testcases/open_posix_testsuite/conformance/interfaces/sigaction/ [2,head] auth v=746 cv=0/0 state=1073741825|complete f(v0 m2020-12-09T04:20:30.190201+0000) n(v5 rc2020-12-09T04:20:30.190201+0000) hs=0+0,ss=0+0 dirty=58 0x556e23daa480]
10100126:2020-12-09T04:21:19.478+0000 7fea7d38d700 10 mds.0.cache.dir(0x1000000171d.0*)  subfrag 010* [dir 0x1000000171d.010* /client.0/tmp/fsstress/ltp-full-20091231/testcases/open_posix_testsuite/conformance/interfaces/sigaction/ [2,head] auth v=743 cv=0/0 state=1610612737|complete f(v0 m2020-12-09T04:20:30.318198+0000) n(v5 rc2020-12-09T04:20:30.318198+0000) hs=0+56,ss=0+0 dirty=56 | child=1 dirty=1 0x556e23daa900]
...
10100184:2020-12-09T04:21:19.478+0000 7fea7d38d700 14 mds.0.cache.ino(0x1000000171d) close_dirfrag 010*
10100185:2020-12-09T04:21:19.478+0000 7fea7d38d700 12 mds.0.cache.dir(0x1000000171d.010*) remove_null_dentries [dir 0x1000000171d.010* /client.0/tmp/fsstress/ltp-full-20091231/testcases/open_posix_testsuite/conformance/interfaces/sigaction/ [2,head] auth v=743 cv=0/0 state=1073741825|complete f(v0 m2020-12-09T04:20:30.318198+0000) n(v5 rc2020-12-09T04:20:30.318198+0000) hs=0+0,ss=0+0 dirty=56 0x556e23daa900]
10100186:2020-12-09T04:21:19.478+0000 7fea7d38d700 10 mds.0.cache.dir(0x1000000171d.0*)  subfrag 011* [dir 0x1000000171d.011* /client.0/tmp/fsstress/ltp-full-20091231/testcases/open_posix_testsuite/conformance/interfaces/sigaction/ [2,head] auth v=851 cv=0/0 state=1610612737|complete f(v0 m2020-12-09T04:20:31.086183+0000) n(v5 rc2020-12-09T04:20:31.086183+0000) hs=0+75,ss=0+0 dirty=75 | child=1 dirty=1 0x556e23daad80]
...
10100263:2020-12-09T04:21:19.478+0000 7fea7d38d700 14 mds.0.cache.ino(0x1000000171d) close_dirfrag 011*
10100264:2020-12-09T04:21:19.478+0000 7fea7d38d700 12 mds.0.cache.dir(0x1000000171d.011*) remove_null_dentries [dir 0x1000000171d.011* /client.0/tmp/fsstress/ltp-full-20091231/testcases/open_posix_testsuite/conformance/interfaces/sigaction/ [2,head] auth v=851 cv=0/0 state=1073741825|complete f(v0 m2020-12-09T04:20:31.086183+0000) n(v5 rc2020-12-09T04:20:31.086183+0000) hs=0+0,ss=0+0 dirty=75 0x556e23daad80]
10100266:2020-12-09T04:21:19.478+0000 7fea7d38d700 10 mds.0.journal EMetaBlob.replay dir 0x1000000171d.0*
10100267:2020-12-09T04:21:19.478+0000 7fea7d38d700 10 mds.0.journal EMetaBlob.replay updated dir [dir 0x1000000171d.0* /client.0/tmp/fsstress/ltp-full-20091231/testcases/open_posix_testsuite/conformance/interfaces/sigaction/ [2,head] auth v=851 cv=0/0 state=1610612736 f()/f(v0 m2020-12-09T04:20:31.086183+0000) n(v5)/n(v5 rc2020-12-09T04:20:31.086183+0000) hs=0+244,ss=0+0 dirty=244 | child=1 dirty=1 0x556e241c9b00]

After the subdirs of [0x1000000171d.000*, 0x1000000171d.001*, 0x1000000171d.010*, 0x1000000171d.011*] successfuly merged into [0x1000000171d.0*], they still remain in the loaded_anchor_map.

1067     for (auto& fg: anchor.frags) {
1068       CDir *dir = diri->get_dirfrag(fg);  
1069       if (dir) {
1070         if (dir->is_auth() && !dir->is_complete())
1071           fetch_queue.insert(dir);
1072       } else {
1073         frag_vec_t leaves;
1074         diri->dirfragtree.get_leaves_under(fg, leaves);
1075         if (leaves.empty())
1076           leaves.push_back(diri->dirfragtree[fg.value()]);                                                                           
1077         for (auto& leaf : leaves) {
1078           if (diri->is_auth()) {
1079             dir = diri->get_or_open_dirfrag(mdcache, leaf);   
1080           } else {
1081             dir = diri->get_dirfrag(leaf);      
1082           }
1083           if (dir && dir->is_auth() && !dir->is_complete())
1084             fetch_queue.push_back(dir);            
1085         }
1086       }
1087     }

Dues to the Line#1076 will always return and push the parent frag "0*" into the "leaves", when traversing the first 4 frags of [000*, 001*, 010*, 011*, 100*, 101*, 110*, 111*], which will lead to the (dir 0x1000000171d.0*) into fetch_queue vector 4 times.

Actions #4

Updated by Xiubo Li over 3 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID changed from 38512 to 38579
Actions #5

Updated by Patrick Donnelly over 3 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF