Bug #22374
closedluminous: mds: SimpleLock::num_rdlock overloaded
0%
Description
Recently, when doing massive directory delete test, both active mds and standby mds aborted and can't be started again. The error log is as follows:
@ 6> 2017-12-04 17:59:45.509030 7ff4bb7fb700 7 mds.0.locker rdlock_finish on (ifile sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (isnap sync r=1) (inest lock) (iversion lock) caps={27077199=p/@0},l=27077199 | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
5> 2017-12-04 17:59:45.509042 7ff4bb7fb700 7 mds.0.locker file_eval wanted= loner_wanted= other_wanted= filelock=(ifile sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (isnap sync r=1) (inest lock) (iversion lock) caps={27077199=p/@0},l=27077199 | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
4> 2017-12-04 17:59:45.509051 7ff4bb7fb700 7 mds.0.locker file_eval stable, bump to loner (ifile sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (isnap sync r=1) (inest lock) (iversion lock) caps={27077199=p/@0},l=27077199 | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
3> 2017-12-04 17:59:45.509059 7ff4bb7fb700 7 mds.0.locker file_excl (ifile sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (isnap sync r=1) (inest lock) (iversion lock) caps={27077199=p/@0},l=27077199 | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
2> 2017-12-04 17:59:45.509073 7ff4bb7fb700 7 mds.0.locker rdlock_finish on (isnap sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (inest lock) (ifile excl) (iversion lock) caps={27077199=p/@0},l=27077199 | request=1 lock=0 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
1> 2017-12-04 17:59:45.509087 7ff4bb7fb700 10 mds.0.locker simple_eval (isnap sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (inest lock) (ifile excl) (iversion lock) caps={27077199=p/@0},l=27077199 | request=1 lock=0 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
0> 2017-12-04 17:59:45.514015 7ff4bb7fb700 -1 /home/xuxuehan/rpmbuild/BUILD/ceph-12.2.1-26-g9915a2f/src/mds/SimpleLock.h: In function 'int SimpleLock::put_rdlock()' thread 7ff4bb7fb700 time 2017-12-04 17:59:45.509128
/home/xuxuehan/rpmbuild/BUILD/ceph-12.2.1-26-g9915a2f/src/mds/SimpleLock.h: 450: FAILED assert(num_rdlock>0)
ceph version 12.2.1-26-g9915a2f (9915a2f600788c37fb42f39a766ce93b9ae4d816) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7ff4d334c940]
2: (()+0x295722) [0x7ff4d300d722]
3: (Locker::rdlock_finish(SimpleLock*, MutationImpl*, bool*)+0x23c) [0x7ff4d31b2d2c]
4: (Locker::_drop_rdlocks(MutationImpl*, std::set<CInode*, std::less<CInode*>, std::allocator<CInode*> >)+0x64) [0x7ff4d31b2dc4]
5: (Locker::drop_locks(MutationImpl, std::set<CInode*, std::less<CInode*>, std::allocator<CInode*> >)+0x84) [0x7ff4d31b6e84]
6: (MDCache::request_cleanup(boost::intrusive_ptr<MDRequestImpl>&)+0x11a) [0x7ff4d310528a]
7: (MDCache::request_finish(boost::intrusive_ptr<MDRequestImpl>&)+0x17b) [0x7ff4d31056eb]
8: (Server::reply_client_request(boost::intrusive_ptr<MDRequestImpl>&, MClientReply)+0x4e4) [0x7ff4d30816f4]
9: (Server::respond_to_request(boost::intrusive_ptr<MDRequestImpl>&, int)+0x459) [0x7ff4d3082279]
10: (Server::_unlink_local_finish(boost::intrusive_ptr<MDRequestImpl>&, CDentry*, CDentry*, unsigned long)+0x2ae) [0x7ff4d308ca7e]
@
We debugged the coredump, and found that the num_rdlock of "snaplock" of parent directories' inodes is -32617:
$2 = {_vptr.SimpleLock = 0x7ff4d3ae2c50 <vtable for SimpleLock+16>, type = 0x7ff4d3b0ae60 <CInode::snaplock_type>, static WAIT_RD = 1, static WAIT_WR = 2, static WAIT_XLOCK = 4, static WAIT_STABLE = 4, static WAIT_REMOTEXLOCK = 8, static WAIT_BITS = 4,
static WAIT_ALL = 15, parent = 0x7ff4b7c5a400, state = 1, num_rdlock = -32617, num_client_lease = 0, Python Exception <type 'exceptions.TypeError'> expected string or buffer:
_unstable = , static empty_gather_set = {_M_t = {
_M_impl = {<std::allocator<std::_Rb_tree_node<int> >> = {<__gnu_cxx::new_allocator<std::_Rb_tree_node<int> >> = {<No data fields>}, <No data fields>}, _M_key_compare = {<std::binary_function<int, int, bool>> = {<No data fields>}, <No data fields>}, _M_header = {
_M_color = std::_S_red, _M_parent = 0x0, _M_left = 0x7ff4d3b0a9a8 <SimpleLock::empty_gather_set+8>, _M_right = 0x7ff4d3b0a9a8 <SimpleLock::empty_gather_set+8>}, _M_node_count = 0}}}}
We read the source code and found that num_rdlock can only be subtracted by put_rdlock which would first test the assertion "num_rdlock > 0" and wouldn't let it be subtracted down to -32617, so we thought that this could be due to overload of the field. num_rdlock's type is "__s16", and we were trying to delete 7 million files through 70 concurrent clients. So we modified the field's type to "__s64", the error didn't show up anymore.
Updated by Xuehan Xu over 6 years ago
Sorry, I misedited the log:
@ 6> 2017-12-04 17:59:45.509030 7ff4bb7fb700 7 mds.0.locker rdlock_finish on (ifile sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (isnap sync r=1) (inest lock) (iversion lock) caps={27077199=p/@0},l=27077199 | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
5> 2017-12-04 17:59:45.509042 7ff4bb7fb700 7 mds.0.locker file_eval wanted= loner_wanted= other_wanted= filelock=(ifile sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (isnap sync r=1) (inest lock) (iversion lock) caps={27077199=p/@0},l=27077199 | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
4> 2017-12-04 17:59:45.509051 7ff4bb7fb700 7 mds.0.locker file_eval stable, bump to loner (ifile sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (isnap sync r=1) (inest lock) (iversion lock) caps={27077199=p/@0},l=27077199 | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
3> 2017-12-04 17:59:45.509059 7ff4bb7fb700 7 mds.0.locker file_excl (ifile sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (isnap sync r=1) (inest lock) (iversion lock) caps={27077199=p/@0},l=27077199 | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
2> 2017-12-04 17:59:45.509073 7ff4bb7fb700 7 mds.0.locker rdlock_finish on (isnap sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (inest lock) (ifile excl) (iversion lock) caps={27077199=p/@0},l=27077199 | request=1 lock=0 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
1> 2017-12-04 17:59:45.509087 7ff4bb7fb700 10 mds.0.locker simple_eval (isnap sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (inest lock) (ifile excl) (iversion lock) caps={27077199=p/@0},l=27077199 | request=1 lock=0 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
0> 2017-12-04 17:59:45.514015 7ff4bb7fb700 -1 /home/xuxuehan/rpmbuild/BUILD/ceph-12.2.1-26-g9915a2f/src/mds/SimpleLock.h: In function 'int SimpleLock::put_rdlock()' thread 7ff4bb7fb700 time 2017-12-04 17:59:45.509128
/home/xuxuehan/rpmbuild/BUILD/ceph-12.2.1-26-g9915a2f/src/mds/SimpleLock.h: 450: FAILED assert(num_rdlock>0)
ceph version 12.2.1-26-g9915a2f (9915a2f600788c37fb42f39a766ce93b9ae4d816) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7ff4d334c940]
2: (()+0x295722) [0x7ff4d300d722]
3: (Locker::rdlock_finish(SimpleLock*, MutationImpl*, bool*)+0x23c) [0x7ff4d31b2d2c]
4: (Locker::_drop_rdlocks(MutationImpl*, std::set<CInode*, std::less<CInode*>, std::allocator<CInode*> >)+0x64) [0x7ff4d31b2dc4]
5: (Locker::drop_locks(MutationImpl, std::set<CInode*, std::less<CInode*>, std::allocator<CInode*> >)+0x84) [0x7ff4d31b6e84]
6: (MDCache::request_cleanup(boost::intrusive_ptr<MDRequestImpl>&)+0x11a) [0x7ff4d310528a]
7: (MDCache::request_finish(boost::intrusive_ptr<MDRequestImpl>&)+0x17b) [0x7ff4d31056eb]
8: (Server::reply_client_request(boost::intrusive_ptr<MDRequestImpl>&, MClientReply)+0x4e4) [0x7ff4d30816f4]
9: (Server::respond_to_request(boost::intrusive_ptr<MDRequestImpl>&, int)+0x459) [0x7ff4d3082279]
10: (Server::_unlink_local_finish(boost::intrusive_ptr<MDRequestImpl>&, CDentry*, CDentry*, unsigned long)+0x2ae) [0x7ff4d308ca7e]
@
Updated by Xuehan Xu over 6 years ago
@ 6> 2017-12-04 17:59:45.509030 7ff4bb7fb700 7 mds.0.locker rdlock_finish on (ifile sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (isnap sync r=1) (inest lock) (iversion lock) caps={27077199=p/-@0},l=27077199 | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
5> 2017-12-04 17:59:45.509042 7ff4bb7fb700 7 mds.0.locker file_eval wanted= loner_wanted= other_wanted= filelock=(ifile sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (isnap sync r=1) (inest lock) (iversion lock) caps={27077199=p/-@0},l=27077199 | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
4> 2017-12-04 17:59:45.509051 7ff4bb7fb700 7 mds.0.locker file_eval stable, bump to loner (ifile sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (isnap sync r=1) (inest lock) (iversion lock) caps={27077199=p/-@0},l=27077199 | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
3> 2017-12-04 17:59:45.509059 7ff4bb7fb700 7 mds.0.locker file_excl (ifile sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (isnap sync r=1) (inest lock) (iversion lock) caps={27077199=p/-@0},l=27077199 | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
2> 2017-12-04 17:59:45.509073 7ff4bb7fb700 7 mds.0.locker rdlock_finish on (isnap sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (inest lock) (ifile excl) (iversion lock) caps={27077199=p/-@0},l=27077199 | request=1 lock=0 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
1> 2017-12-04 17:59:45.509087 7ff4bb7fb700 10 mds.0.locker simple_eval (isnap sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (inest lock) (ifile excl) (iversion lock) caps={27077199=p/-@0},l=27077199 | request=1 lock=0 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200]
0> 2017-12-04 17:59:45.514015 7ff4bb7fb700 -1 /home/xuxuehan/rpmbuild/BUILD/ceph-12.2.1-26-g9915a2f/src/mds/SimpleLock.h: In function 'int SimpleLock::put_rdlock()' thread 7ff4bb7fb700 time 2017-12-04 17:59:45.509128
/home/xuxuehan/rpmbuild/BUILD/ceph-12.2.1-26-g9915a2f/src/mds/SimpleLock.h: 450: FAILED assert(num_rdlock>0)
ceph version 12.2.1-26-g9915a2f (9915a2f600788c37fb42f39a766ce93b9ae4d816) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7ff4d334c940]
2: (()+0x295722) [0x7ff4d300d722]
3: (Locker::rdlock_finish(SimpleLock*, MutationImpl*, bool*)+0x23c) [0x7ff4d31b2d2c]
4: (Locker::_drop_rdlocks(MutationImpl*, std::set<CInode*, std::less<CInode*>, std::allocator<CInode*> >)+0x64) [0x7ff4d31b2dc4]
5: (Locker::drop_locks(MutationImpl, std::set<CInode*, std::less<CInode*>, std::allocator<CInode*> >)+0x84) [0x7ff4d31b6e84]
6: (MDCache::request_cleanup(boost::intrusive_ptr<MDRequestImpl>&)+0x11a) [0x7ff4d310528a]
7: (MDCache::request_finish(boost::intrusive_ptr<MDRequestImpl>&)+0x17b) [0x7ff4d31056eb]
8: (Server::reply_client_request(boost::intrusive_ptr<MDRequestImpl>&, MClientReply)+0x4e4) [0x7ff4d30816f4]
9: (Server::respond_to_request(boost::intrusive_ptr<MDRequestImpl>&, int)+0x459) [0x7ff4d3082279]
10: (Server::_unlink_local_finish(boost::intrusive_ptr<MDRequestImpl>&, CDentry*, CDentry*, unsigned long)+0x2ae) [0x7ff4d308ca7e]
@
Updated by Xuehan Xu over 6 years ago
-6> 2017-12-04 17:59:45.509030 7ff4bb7fb700 7 mds.0.locker rdlock_finish on (ifile sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (isnap sync r=1) (inest lock) (iversion lock) caps={27077199=p/-@0},l=27077199 | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200] -5> 2017-12-04 17:59:45.509042 7ff4bb7fb700 7 mds.0.locker file_eval wanted= loner_wanted= other_wanted= filelock=(ifile sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (isnap sync r=1) (inest lock) (iversion lock) caps={27077199=p/-@0},l=27077199 | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200] -4> 2017-12-04 17:59:45.509051 7ff4bb7fb700 7 mds.0.locker file_eval stable, bump to loner (ifile sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (isnap sync r=1) (inest lock) (iversion lock) caps={27077199=p/-@0},l=27077199 | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200] -3> 2017-12-04 17:59:45.509059 7ff4bb7fb700 7 mds.0.locker file_excl (ifile sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (isnap sync r=1) (inest lock) (iversion lock) caps={27077199=p/-@0},l=27077199 | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200] -2> 2017-12-04 17:59:45.509073 7ff4bb7fb700 7 mds.0.locker rdlock_finish on (isnap sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (inest lock) (ifile excl) (iversion lock) caps={27077199=p/-@0},l=27077199 | request=1 lock=0 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200] -1> 2017-12-04 17:59:45.509087 7ff4bb7fb700 10 mds.0.locker simple_eval (isnap sync) on [inode 0x10004cec485 [...2,head] ~mds0/stray5/10004cec485/ auth v13769303 ap=1+0 dirtyparent f() n(v0 1=0+1) (inest lock) (ifile excl) (iversion lock) caps={27077199=p/-@0},l=27077199 | request=1 lock=0 caps=1 dirtyparent=1 dirty=1 authpin=1 0x7ff2d220b200] 0> 2017-12-04 17:59:45.514015 7ff4bb7fb700 -1 /home/xuxuehan/rpmbuild/BUILD/ceph-12.2.1-26-g9915a2f/src/mds/SimpleLock.h: In function 'int SimpleLock::put_rdlock()' thread 7ff4bb7fb700 time 2017-12-04 17:59:45.509128 /home/xuxuehan/rpmbuild/BUILD/ceph-12.2.1-26-g9915a2f/src/mds/SimpleLock.h: 450: FAILED assert(num_rdlock>0) ceph version 12.2.1-26-g9915a2f (9915a2f600788c37fb42f39a766ce93b9ae4d816) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7ff4d334c940] 2: (()+0x295722) [0x7ff4d300d722] 3: (Locker::rdlock_finish(SimpleLock*, MutationImpl*, bool*)+0x23c) [0x7ff4d31b2d2c] 4: (Locker::_drop_rdlocks(MutationImpl*, std::set<CInode*, std::less<CInode*>, std::allocator<CInode*> >*)+0x64) [0x7ff4d31b2dc4] 5: (Locker::drop_locks(MutationImpl*, std::set<CInode*, std::less<CInode*>, std::allocator<CInode*> >*)+0x84) [0x7ff4d31b6e84] 6: (MDCache::request_cleanup(boost::intrusive_ptr<MDRequestImpl>&)+0x11a) [0x7ff4d310528a] 7: (MDCache::request_finish(boost::intrusive_ptr<MDRequestImpl>&)+0x17b) [0x7ff4d31056eb] 8: (Server::reply_client_request(boost::intrusive_ptr<MDRequestImpl>&, MClientReply*)+0x4e4) [0x7ff4d30816f4] 9: (Server::respond_to_request(boost::intrusive_ptr<MDRequestImpl>&, int)+0x459) [0x7ff4d3082279] 10: (Server::_unlink_local_finish(boost::intrusive_ptr<MDRequestImpl>&, CDentry*, CDentry*, unsigned long)+0x2ae) [0x7ff4d308ca7e]
Updated by Xuehan Xu over 6 years ago
I just submitted a PR for this: https://github.com/ceph/ceph/pull/19442
Updated by John Spray over 6 years ago
- Project changed from Ceph to CephFS
- Subject changed from mds SimpleLock::num_rdlock overloaded to mds: SimpleLock::num_rdlock overloaded
- Category set to Correctness/Safety
- Status changed from New to Fix Under Review
Updated by Patrick Donnelly over 6 years ago
- Subject changed from mds: SimpleLock::num_rdlock overloaded to luminous: mds: SimpleLock::num_rdlock overloaded
- Target version set to v12.2.3
This is also fixed in master already by one of Zheng's commits. We need to link to the commit in master where this is fixed even if the commits are not the same.
Updated by Nathan Cutler over 6 years ago
- Is duplicate of Bug #21843: mds: preserve order of requests during recovery of multimds cluster added
Updated by Nathan Cutler over 6 years ago
- Status changed from Fix Under Review to Duplicate