Project

General

Profile

Actions

Bug #22374

closed

luminous: mds: SimpleLock::num_rdlock overloaded

Added by Xuehan Xu over 6 years ago. Updated over 6 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Recently, when doing massive directory delete test, both active mds and standby mds aborted and can't be started again. The error log is as follows:

@ 6> 2017-12-04 17:59:45.509030 7ff4bb7fb700 7 mds.0.locker rdlock_finish on (ifile sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (isnap sync r=1) (inest lock) (iversion lock) caps={27077199=p/@0},l=27077199 | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
5> 2017-12-04 17:59:45.509042 7ff4bb7fb700 7 mds.0.locker file_eval wanted= loner_wanted= other_wanted= filelock=(ifile sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (isnap sync r=1) (inest lock) (iversion lock) caps={27077199=p/@0},l=27077199 | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
4> 2017-12-04 17:59:45.509051 7ff4bb7fb700 7 mds.0.locker file_eval stable, bump to loner (ifile sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (isnap sync r=1) (inest lock) (iversion lock) caps={27077199=p/@0},l=27077199 | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
3> 2017-12-04 17:59:45.509059 7ff4bb7fb700 7 mds.0.locker file_excl (ifile sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (isnap sync r=1) (inest lock) (iversion lock) caps={27077199=p/@0},l=27077199 | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
2> 2017-12-04 17:59:45.509073 7ff4bb7fb700 7 mds.0.locker rdlock_finish on (isnap sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (inest lock) (ifile excl) (iversion lock) caps={27077199=p/@0},l=27077199 | request=1 lock=0 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
1> 2017-12-04 17:59:45.509087 7ff4bb7fb700 10 mds.0.locker simple_eval (isnap sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (inest lock) (ifile excl) (iversion lock) caps={27077199=p/@0},l=27077199 | request=1 lock=0 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
0> 2017-12-04 17:59:45.514015 7ff4bb7fb700 -1 /home/xuxuehan/rpmbuild/BUILD/ceph-12.2.1-26-g9915a2f/src/mds/SimpleLock.h: In function 'int SimpleLock::put_rdlock()' thread 7ff4bb7fb700 time 2017-12-04 17:59:45.509128
/home/xuxuehan/rpmbuild/BUILD/ceph-12.2.1-26-g9915a2f/src/mds/SimpleLock.h: 450: FAILED assert(num_rdlock>0)

ceph version 12.2.1-26-g9915a2f (9915a2f600788c37fb42f39a766ce93b9ae4d816) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7ff4d334c940]
2: (()+0x295722) [0x7ff4d300d722]
3: (Locker::rdlock_finish(SimpleLock*, MutationImpl*, bool*)+0x23c) [0x7ff4d31b2d2c]
4: (Locker::_drop_rdlocks(MutationImpl*, std::set<CInode*, std::less<CInode*>, std::allocator<CInode*> >)+0x64) [0x7ff4d31b2dc4]
5: (Locker::drop_locks(MutationImpl
, std::set<CInode*, std::less<CInode*>, std::allocator<CInode*> >)+0x84) [0x7ff4d31b6e84]
6: (MDCache::request_cleanup(boost::intrusive_ptr<MDRequestImpl>&)+0x11a) [0x7ff4d310528a]
7: (MDCache::request_finish(boost::intrusive_ptr<MDRequestImpl>&)+0x17b) [0x7ff4d31056eb]
8: (Server::reply_client_request(boost::intrusive_ptr<MDRequestImpl>&, MClientReply
)+0x4e4) [0x7ff4d30816f4]
9: (Server::respond_to_request(boost::intrusive_ptr<MDRequestImpl>&, int)+0x459) [0x7ff4d3082279]
10: (Server::_unlink_local_finish(boost::intrusive_ptr<MDRequestImpl>&, CDentry*, CDentry*, unsigned long)+0x2ae) [0x7ff4d308ca7e]
@

We debugged the coredump, and found that the num_rdlock of "snaplock" of parent directories' inodes is -32617:

$2 = {_vptr.SimpleLock = 0x7ff4d3ae2c50 <vtable for SimpleLock+16>, type = 0x7ff4d3b0ae60 <CInode::snaplock_type>, static WAIT_RD = 1, static WAIT_WR = 2, static WAIT_XLOCK = 4, static WAIT_STABLE = 4, static WAIT_REMOTEXLOCK = 8, static WAIT_BITS = 4,
static WAIT_ALL = 15, parent = 0x7ff4b7c5a400, state = 1, num_rdlock = -32617, num_client_lease = 0, Python Exception <type 'exceptions.TypeError'> expected string or buffer:
_unstable = , static empty_gather_set = {_M_t = {
_M_impl = {<std::allocator<std::_Rb_tree_node<int> >> = {<__gnu_cxx::new_allocator<std::_Rb_tree_node<int> >> = {<No data fields>}, <No data fields>}, _M_key_compare = {<std::binary_function<int, int, bool>> = {<No data fields>}, <No data fields>}, _M_header = {
_M_color = std::_S_red, _M_parent = 0x0, _M_left = 0x7ff4d3b0a9a8 <SimpleLock::empty_gather_set+8>, _M_right = 0x7ff4d3b0a9a8 <SimpleLock::empty_gather_set+8>}, _M_node_count = 0}}}}

We read the source code and found that num_rdlock can only be subtracted by put_rdlock which would first test the assertion "num_rdlock > 0" and wouldn't let it be subtracted down to -32617, so we thought that this could be due to overload of the field. num_rdlock's type is "__s16", and we were trying to delete 7 million files through 70 concurrent clients. So we modified the field's type to "__s64", the error didn't show up anymore.


Related issues 1 (0 open1 closed)

Is duplicate of CephFS - Bug #21843: mds: preserve order of requests during recovery of multimds clusterResolved10/19/2017

Actions
Actions

Also available in: Atom PDF