Bug #11541
closedMDS is crashed (mds/CDir.cc: 1391: FAILED assert(!is_complete()))
0%
Description
After update from 0.82 to 0.94.1 mds crached:
root@virt-master:~# /usr/bin/ceph-mds -i virt-master --debug_ms 9 --debug_mds 9 --pid-file /var/run/ceph/mds.virt-master.pid -c /etc/ceph/ceph.conf --cluster ceph -f > mds.virt-master.log 2015-05-06 13:13:07.076817 7f9e13a85780 -1 mds.-1.0 log_to_monitors {default=true} mds/CDir.cc: In function 'void CDir::fetch(MDSInternalContextBase*, const string&, bool)' thread 7f9e0a48f700 time 2015-05-06 13:13:11.230459 mds/CDir.cc: 1391: FAILED assert(!is_complete()) ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x72) [0xa38d72] 2: (CDir::fetch(MDSInternalContextBase*, std::string const&, bool)+0x7fc) [0x86434c] 3: (CDir::fetch(MDSInternalContextBase*, bool)+0x2b) [0x8643db] 4: (MDCache::open_undef_inodes_dirfrags()+0x333) [0x74aec3] 5: (MDCache::rejoin_gather_finish()+0x6c) [0x79745c] 6: (MDSInternalContextBase::complete(int)+0x15b) [0x8be9bb] 7: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::delete_me()+0x16) [0x6ba0b6] 8: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::sub_finish(MDSInternalContextBase*, int)+0x265) [0x6c0bd5] 9: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::finish(int)+0x12) [0x6c0ce2] 10: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::complete(int)+0x9) [0x6afc59] 11: (MDS::_advance_queues()+0x112) [0x69e642] 12: (MDS::ProgressThread::entry()+0x4a) [0x69f12a] 13: (()+0x6b50) [0x7f9e13410b50] 14: (clone()+0x6d) [0x7f9e1203095d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2015-05-06 13:13:11.232253 7f9e0a48f700 -1 mds/CDir.cc: In function 'void CDir::fetch(MDSInternalContextBase*, const string&, bool)' thread 7f9e0a48f700 time 2015-05-06 13:13:11.230459 mds/CDir.cc: 1391: FAILED assert(!is_complete()) ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x72) [0xa38d72] 2: (CDir::fetch(MDSInternalContextBase*, std::string const&, bool)+0x7fc) [0x86434c] 3: (CDir::fetch(MDSInternalContextBase*, bool)+0x2b) [0x8643db] 4: (MDCache::open_undef_inodes_dirfrags()+0x333) [0x74aec3] 5: (MDCache::rejoin_gather_finish()+0x6c) [0x79745c] 6: (MDSInternalContextBase::complete(int)+0x15b) [0x8be9bb] 7: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::delete_me()+0x16) [0x6ba0b6] 8: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::sub_finish(MDSInternalContextBase*, int)+0x265) [0x6c0bd5] 9: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::finish(int)+0x12) [0x6c0ce2] 10: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::complete(int)+0x9) [0x6afc59] 11: (MDS::_advance_queues()+0x112) [0x69e642] 12: (MDS::ProgressThread::entry()+0x4a) [0x69f12a] 13: (()+0x6b50) [0x7f9e13410b50] 14: (clone()+0x6d) [0x7f9e1203095d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 0> 2015-05-06 13:13:11.232253 7f9e0a48f700 -1 mds/CDir.cc: In function 'void CDir::fetch(MDSInternalContextBase*, const string&, bool)' thread 7f9e0a48f700 time 2015-05-06 13:13:11.230459 mds/CDir.cc: 1391: FAILED assert(!is_complete()) ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x72) [0xa38d72] 2: (CDir::fetch(MDSInternalContextBase*, std::string const&, bool)+0x7fc) [0x86434c] 3: (CDir::fetch(MDSInternalContextBase*, bool)+0x2b) [0x8643db] 4: (MDCache::open_undef_inodes_dirfrags()+0x333) [0x74aec3] 5: (MDCache::rejoin_gather_finish()+0x6c) [0x79745c] 6: (MDSInternalContextBase::complete(int)+0x15b) [0x8be9bb] 7: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::delete_me()+0x16) [0x6ba0b6] 8: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::sub_finish(MDSInternalContextBase*, int)+0x265) [0x6c0bd5] 9: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::finish(int)+0x12) [0x6c0ce2] 10: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::complete(int)+0x9) [0x6afc59] 11: (MDS::_advance_queues()+0x112) [0x69e642] 12: (MDS::ProgressThread::entry()+0x4a) [0x69f12a] 13: (()+0x6b50) [0x7f9e13410b50] 14: (clone()+0x6d) [0x7f9e1203095d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' *** Caught signal (Aborted) ** in thread 7f9e0a48f700 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: /usr/bin/ceph-mds() [0x9556ec] 2: (()+0xf0a0) [0x7f9e134190a0] 3: (gsignal()+0x35) [0x7f9e11f87165] 4: (abort()+0x180) [0x7f9e11f8a3e0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f9e127dd89d] 6: (()+0x63996) [0x7f9e127db996] 7: (()+0x639c3) [0x7f9e127db9c3] 8: (()+0x63bee) [0x7f9e127dbbee] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x220) [0xa38f20] 10: (CDir::fetch(MDSInternalContextBase*, std::string const&, bool)+0x7fc) [0x86434c] 11: (CDir::fetch(MDSInternalContextBase*, bool)+0x2b) [0x8643db] 12: (MDCache::open_undef_inodes_dirfrags()+0x333) [0x74aec3] 13: (MDCache::rejoin_gather_finish()+0x6c) [0x79745c] 14: (MDSInternalContextBase::complete(int)+0x15b) [0x8be9bb] 15: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::delete_me()+0x16) [0x6ba0b6] 16: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::sub_finish(MDSInternalContextBase*, int)+0x265) [0x6c0bd5] 17: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::finish(int)+0x12) [0x6c0ce2] 18: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::complete(int)+0x9) [0x6afc59] 19: (MDS::_advance_queues()+0x112) [0x69e642] 20: (MDS::ProgressThread::entry()+0x4a) [0x69f12a] 21: (()+0x6b50) [0x7f9e13410b50] 22: (clone()+0x6d) [0x7f9e1203095d] 2015-05-06 13:13:11.294984 7f9e0a48f700 -1 *** Caught signal (Aborted) ** in thread 7f9e0a48f700 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: /usr/bin/ceph-mds() [0x9556ec] 2: (()+0xf0a0) [0x7f9e134190a0] 3: (gsignal()+0x35) [0x7f9e11f87165] 4: (abort()+0x180) [0x7f9e11f8a3e0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f9e127dd89d] 6: (()+0x63996) [0x7f9e127db996] 7: (()+0x639c3) [0x7f9e127db9c3] 8: (()+0x63bee) [0x7f9e127dbbee] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x220) [0xa38f20] 10: (CDir::fetch(MDSInternalContextBase*, std::string const&, bool)+0x7fc) [0x86434c] 11: (CDir::fetch(MDSInternalContextBase*, bool)+0x2b) [0x8643db] 12: (MDCache::open_undef_inodes_dirfrags()+0x333) [0x74aec3] 13: (MDCache::rejoin_gather_finish()+0x6c) [0x79745c] 14: (MDSInternalContextBase::complete(int)+0x15b) [0x8be9bb] 15: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::delete_me()+0x16) [0x6ba0b6] 16: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::sub_finish(MDSInternalContextBase*, int)+0x265) [0x6c0bd5] 17: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::finish(int)+0x12) [0x6c0ce2] 18: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::complete(int)+0x9) [0x6afc59] 19: (MDS::_advance_queues()+0x112) [0x69e642] 20: (MDS::ProgressThread::entry()+0x4a) [0x69f12a] 21: (()+0x6b50) [0x7f9e13410b50] 22: (clone()+0x6d) [0x7f9e1203095d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 0> 2015-05-06 13:13:11.294984 7f9e0a48f700 -1 *** Caught signal (Aborted) ** in thread 7f9e0a48f700 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: /usr/bin/ceph-mds() [0x9556ec] 2: (()+0xf0a0) [0x7f9e134190a0] 3: (gsignal()+0x35) [0x7f9e11f87165] 4: (abort()+0x180) [0x7f9e11f8a3e0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f9e127dd89d] 6: (()+0x63996) [0x7f9e127db996] 7: (()+0x639c3) [0x7f9e127db9c3] 8: (()+0x63bee) [0x7f9e127dbbee] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x220) [0xa38f20] 10: (CDir::fetch(MDSInternalContextBase*, std::string const&, bool)+0x7fc) [0x86434c] 11: (CDir::fetch(MDSInternalContextBase*, bool)+0x2b) [0x8643db] 12: (MDCache::open_undef_inodes_dirfrags()+0x333) [0x74aec3] 13: (MDCache::rejoin_gather_finish()+0x6c) [0x79745c] 14: (MDSInternalContextBase::complete(int)+0x15b) [0x8be9bb] 15: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::delete_me()+0x16) [0x6ba0b6] 16: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::sub_finish(MDSInternalContextBase*, int)+0x265) [0x6c0bd5] 17: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::finish(int)+0x12) [0x6c0ce2] 18: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::complete(int)+0x9) [0x6afc59] 19: (MDS::_advance_queues()+0x112) [0x69e642] 20: (MDS::ProgressThread::entry()+0x4a) [0x69f12a] 21: (()+0x6b50) [0x7f9e13410b50] 22: (clone()+0x6d) [0x7f9e1203095d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Aborted
See more info in attached logfile. (start cmd: /usr/bin/ceph-mds -i virt-master --debug_ms 9 --debug_mds 9 -d --pid-file /var/run/ceph/mds.virt-master.pid -c /etc/ceph/ceph.conf --cluster ceph > mds.virt-master.log 2>&1 )
Files
Updated by John Spray almost 9 years ago
This may be due to the following change:
commit 818a80736c6b76c031f56708d03c263289686d51 Author: Yan, Zheng <zyan@redhat.com> Date: Wed Dec 3 15:32:33 2014 +0800 mds: drop dirty dentries in deleted directory opened dirfrags and null dirty dentries in deleted directory inode prevent MDCache::eval_stray() from purging the delete inode. It's safe to not commit null dirty dentries in deleted directory to corresponding dirfrag objects, because these dirfrag objects will be deleted soon. Fixes: #10164 Signed-off-by: Yan, Zheng <zyan@redhat.com>
Updated by John Spray almost 9 years ago
The contradiction here seems to be that our code wants any unlinked directories (i.e. in a stray directory) to have no entries, but according to the contents of the cache there are entries, e.g. as a random example take:
dentry [dentry #100/stray4/10000093434/ActionTest.php [2,head] auth NULL (dversion lock) v=83 inode=0 | dirty=1 0x5239640]
I guess we need something during replay or rejoin to apply the try_remove_dentries_for_stray logic to the loaded directories, and clean out any of these dirty dentries before going further.
Updated by John Spray almost 9 years ago
Created wip-11541-hammer-workaround branch. Andrey: once its built you should be able to find some packages on http://ceph.com/gitbuilder.cgi that you can try out (ask on IRC if unsure)
Updated by Andrey Matyashov almost 9 years ago
Hi, i many restart mds daemons, and it started normally.
Updated by Zheng Yan almost 9 years ago
I think the fix should be:
diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc index 211b8b0..23a2ff9 100644 --- a/src/mds/CDir.cc +++ b/src/mds/CDir.cc @@ -1414,9 +1414,17 @@ void CDir::fetch(MDSInternalContextBase *c, const string& want_dn, bool ignore_a // unlinked directory inode shouldn't have any entry if (inode->inode.nlink == 0 && !inode->snaprealm) { dout(7) << "fetch dirfrag for unlinked directory, mark complete" << dendl; - if (get_version() == 0) + if (get_version() == 0) { set_version(1); + + if (state_test(STATE_REJOINUNDEF)) { + assert(cache->mds->is_rejoin()); + state_clear(STATE_REJOINUNDEF); + cache->opened_undef_dirfrag(this); + } + } mark_complete(); + if (c) cache->mds->queue_waiter(c); return;
Updated by John Spray almost 9 years ago
- Status changed from New to Pending Backport
- Backport set to hammer
Needs backport to hammer as that's where the issue appeared.
Updated by Zheng Yan almost 9 years ago
this bug happens only in multimds case. do we need to backport multimds fixes?
Updated by Greg Farnum almost 9 years ago
Generally speaking no, but unless it's difficult to backport we might as well do so for things people have actually hit outside the lab.
Updated by Loïc Dachary almost 9 years ago
- Project changed from Ceph to CephFS
- Category deleted (
1) - Target version deleted (
v0.94) - Affected Versions deleted (
v0.94)
Updated by Zheng Yan almost 9 years ago
- Status changed from Pending Backport to Resolved