Project

General

Profile

Bug #11541

MDS is crashed (mds/CDir.cc: 1391: FAILED assert(!is_complete()))

Added by Andrey Matyashov over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
hammer
Regression:
Yes
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
upgrade/hammer
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After update from 0.82 to 0.94.1 mds crached:

root@virt-master:~# /usr/bin/ceph-mds -i virt-master --debug_ms 9 --debug_mds 9 --pid-file /var/run/ceph/mds.virt-master.pid -c /etc/ceph/ceph.conf --cluster ceph -f > mds.virt-master.log 

2015-05-06 13:13:07.076817 7f9e13a85780 -1 mds.-1.0 log_to_monitors {default=true}
mds/CDir.cc: In function 'void CDir::fetch(MDSInternalContextBase*, const string&, bool)' thread 7f9e0a48f700 time 2015-05-06 13:13:11.230459
mds/CDir.cc: 1391: FAILED assert(!is_complete())
 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x72) [0xa38d72]
 2: (CDir::fetch(MDSInternalContextBase*, std::string const&, bool)+0x7fc) [0x86434c]
 3: (CDir::fetch(MDSInternalContextBase*, bool)+0x2b) [0x8643db]
 4: (MDCache::open_undef_inodes_dirfrags()+0x333) [0x74aec3]
 5: (MDCache::rejoin_gather_finish()+0x6c) [0x79745c]
 6: (MDSInternalContextBase::complete(int)+0x15b) [0x8be9bb]
 7: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::delete_me()+0x16) [0x6ba0b6]
 8: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::sub_finish(MDSInternalContextBase*, int)+0x265) [0x6c0bd5]
 9: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::finish(int)+0x12) [0x6c0ce2]
 10: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::complete(int)+0x9) [0x6afc59]
 11: (MDS::_advance_queues()+0x112) [0x69e642]
 12: (MDS::ProgressThread::entry()+0x4a) [0x69f12a]
 13: (()+0x6b50) [0x7f9e13410b50]
 14: (clone()+0x6d) [0x7f9e1203095d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2015-05-06 13:13:11.232253 7f9e0a48f700 -1 mds/CDir.cc: In function 'void CDir::fetch(MDSInternalContextBase*, const string&, bool)' thread 7f9e0a48f700 time 2015-05-06 13:13:11.230459
mds/CDir.cc: 1391: FAILED assert(!is_complete())

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x72) [0xa38d72]
 2: (CDir::fetch(MDSInternalContextBase*, std::string const&, bool)+0x7fc) [0x86434c]
 3: (CDir::fetch(MDSInternalContextBase*, bool)+0x2b) [0x8643db]
 4: (MDCache::open_undef_inodes_dirfrags()+0x333) [0x74aec3]
 5: (MDCache::rejoin_gather_finish()+0x6c) [0x79745c]
 6: (MDSInternalContextBase::complete(int)+0x15b) [0x8be9bb]
 7: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::delete_me()+0x16) [0x6ba0b6]
 8: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::sub_finish(MDSInternalContextBase*, int)+0x265) [0x6c0bd5]
 9: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::finish(int)+0x12) [0x6c0ce2]
 10: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::complete(int)+0x9) [0x6afc59]
 11: (MDS::_advance_queues()+0x112) [0x69e642]
 12: (MDS::ProgressThread::entry()+0x4a) [0x69f12a]
 13: (()+0x6b50) [0x7f9e13410b50]
 14: (clone()+0x6d) [0x7f9e1203095d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2015-05-06 13:13:11.232253 7f9e0a48f700 -1 mds/CDir.cc: In function 'void CDir::fetch(MDSInternalContextBase*, const string&, bool)' thread 7f9e0a48f700 time 2015-05-06 13:13:11.230459
mds/CDir.cc: 1391: FAILED assert(!is_complete())

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x72) [0xa38d72]
 2: (CDir::fetch(MDSInternalContextBase*, std::string const&, bool)+0x7fc) [0x86434c]
 3: (CDir::fetch(MDSInternalContextBase*, bool)+0x2b) [0x8643db]
 4: (MDCache::open_undef_inodes_dirfrags()+0x333) [0x74aec3]
 5: (MDCache::rejoin_gather_finish()+0x6c) [0x79745c]
 6: (MDSInternalContextBase::complete(int)+0x15b) [0x8be9bb]
 7: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::delete_me()+0x16) [0x6ba0b6]
 8: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::sub_finish(MDSInternalContextBase*, int)+0x265) [0x6c0bd5]
 9: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::finish(int)+0x12) [0x6c0ce2]
 10: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::complete(int)+0x9) [0x6afc59]
 11: (MDS::_advance_queues()+0x112) [0x69e642]
 12: (MDS::ProgressThread::entry()+0x4a) [0x69f12a]
 13: (()+0x6b50) [0x7f9e13410b50]
 14: (clone()+0x6d) [0x7f9e1203095d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

terminate called after throwing an instance of 'ceph::FailedAssertion'
*** Caught signal (Aborted) **
 in thread 7f9e0a48f700
 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: /usr/bin/ceph-mds() [0x9556ec]
 2: (()+0xf0a0) [0x7f9e134190a0]
 3: (gsignal()+0x35) [0x7f9e11f87165]
 4: (abort()+0x180) [0x7f9e11f8a3e0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f9e127dd89d]
 6: (()+0x63996) [0x7f9e127db996]
 7: (()+0x639c3) [0x7f9e127db9c3]
 8: (()+0x63bee) [0x7f9e127dbbee]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x220) [0xa38f20]
 10: (CDir::fetch(MDSInternalContextBase*, std::string const&, bool)+0x7fc) [0x86434c]
 11: (CDir::fetch(MDSInternalContextBase*, bool)+0x2b) [0x8643db]
 12: (MDCache::open_undef_inodes_dirfrags()+0x333) [0x74aec3]
 13: (MDCache::rejoin_gather_finish()+0x6c) [0x79745c]
 14: (MDSInternalContextBase::complete(int)+0x15b) [0x8be9bb]
 15: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::delete_me()+0x16) [0x6ba0b6]
 16: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::sub_finish(MDSInternalContextBase*, int)+0x265) [0x6c0bd5]
 17: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::finish(int)+0x12) [0x6c0ce2]
 18: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::complete(int)+0x9) [0x6afc59]
 19: (MDS::_advance_queues()+0x112) [0x69e642]
 20: (MDS::ProgressThread::entry()+0x4a) [0x69f12a]
 21: (()+0x6b50) [0x7f9e13410b50]
 22: (clone()+0x6d) [0x7f9e1203095d]
2015-05-06 13:13:11.294984 7f9e0a48f700 -1 *** Caught signal (Aborted) **
 in thread 7f9e0a48f700

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: /usr/bin/ceph-mds() [0x9556ec]
 2: (()+0xf0a0) [0x7f9e134190a0]
 3: (gsignal()+0x35) [0x7f9e11f87165]
 4: (abort()+0x180) [0x7f9e11f8a3e0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f9e127dd89d]
 6: (()+0x63996) [0x7f9e127db996]
 7: (()+0x639c3) [0x7f9e127db9c3]
 8: (()+0x63bee) [0x7f9e127dbbee]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x220) [0xa38f20]
 10: (CDir::fetch(MDSInternalContextBase*, std::string const&, bool)+0x7fc) [0x86434c]
 11: (CDir::fetch(MDSInternalContextBase*, bool)+0x2b) [0x8643db]
 12: (MDCache::open_undef_inodes_dirfrags()+0x333) [0x74aec3]
 13: (MDCache::rejoin_gather_finish()+0x6c) [0x79745c]
 14: (MDSInternalContextBase::complete(int)+0x15b) [0x8be9bb]
 15: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::delete_me()+0x16) [0x6ba0b6]
 16: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::sub_finish(MDSInternalContextBase*, int)+0x265) [0x6c0bd5]
 17: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::finish(int)+0x12) [0x6c0ce2]
 18: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::complete(int)+0x9) [0x6afc59]
 19: (MDS::_advance_queues()+0x112) [0x69e642]
 20: (MDS::ProgressThread::entry()+0x4a) [0x69f12a]
 21: (()+0x6b50) [0x7f9e13410b50]
 22: (clone()+0x6d) [0x7f9e1203095d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2015-05-06 13:13:11.294984 7f9e0a48f700 -1 *** Caught signal (Aborted) **
 in thread 7f9e0a48f700

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: /usr/bin/ceph-mds() [0x9556ec]
 2: (()+0xf0a0) [0x7f9e134190a0]
 3: (gsignal()+0x35) [0x7f9e11f87165]
 4: (abort()+0x180) [0x7f9e11f8a3e0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f9e127dd89d]
 6: (()+0x63996) [0x7f9e127db996]
 7: (()+0x639c3) [0x7f9e127db9c3]
 8: (()+0x63bee) [0x7f9e127dbbee]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x220) [0xa38f20]
 10: (CDir::fetch(MDSInternalContextBase*, std::string const&, bool)+0x7fc) [0x86434c]
 11: (CDir::fetch(MDSInternalContextBase*, bool)+0x2b) [0x8643db]
 12: (MDCache::open_undef_inodes_dirfrags()+0x333) [0x74aec3]
 13: (MDCache::rejoin_gather_finish()+0x6c) [0x79745c]
 14: (MDSInternalContextBase::complete(int)+0x15b) [0x8be9bb]
 15: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::delete_me()+0x16) [0x6ba0b6]
 16: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::sub_finish(MDSInternalContextBase*, int)+0x265) [0x6c0bd5]
 17: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::finish(int)+0x12) [0x6c0ce2]
 18: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::complete(int)+0x9) [0x6afc59]
 19: (MDS::_advance_queues()+0x112) [0x69e642]
 20: (MDS::ProgressThread::entry()+0x4a) [0x69f12a]
 21: (()+0x6b50) [0x7f9e13410b50]
 22: (clone()+0x6d) [0x7f9e1203095d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Aborted

See more info in attached logfile. (start cmd: /usr/bin/ceph-mds -i virt-master --debug_ms 9 --debug_mds 9 -d --pid-file /var/run/ceph/mds.virt-master.pid -c /etc/ceph/ceph.conf --cluster ceph > mds.virt-master.log 2>&1 )

mds.virt-master.log.bz2 (319 KB) Andrey Matyashov, 05/06/2015 10:28 AM


Related issues

Copied to CephFS - Backport #11737: MDS is crashed (mds/CDir.cc: 1391: FAILED assert(!is_complete())) Resolved

Associated revisions

Revision ab1e5394 (diff)
Added by Yan, Zheng over 6 years ago

mds: clear CDir::STATE_REJOINUNDEF after fetching dirfrag

Fixes: #11541
Signed-off-by: Yan, Zheng <>

Revision d723e115 (diff)
Added by Yan, Zheng over 6 years ago

mds: clear CDir::STATE_REJOINUNDEF after fetching dirfrag

Fixes: #11541
Signed-off-by: Yan, Zheng <>
(cherry picked from commit ab1e5394dc778f6799472bd79a4d9ba7197107c2)

History

#1 Updated by John Spray over 6 years ago

This may be due to the following change:

commit 818a80736c6b76c031f56708d03c263289686d51
Author: Yan, Zheng <zyan@redhat.com>
Date:   Wed Dec 3 15:32:33 2014 +0800

    mds: drop dirty dentries in deleted directory

    opened dirfrags and null dirty dentries in deleted directory inode
    prevent MDCache::eval_stray() from purging the delete inode.

    It's safe to not commit null dirty dentries in deleted directory to
    corresponding dirfrag objects, because these dirfrag objects will be
    deleted soon.

    Fixes: #10164
    Signed-off-by: Yan, Zheng <zyan@redhat.com>

#2 Updated by John Spray over 6 years ago

The contradiction here seems to be that our code wants any unlinked directories (i.e. in a stray directory) to have no entries, but according to the contents of the cache there are entries, e.g. as a random example take:

dentry [dentry #100/stray4/10000093434/ActionTest.php [2,head] auth NULL (dversion lock) v=83 inode=0 | dirty=1 0x5239640]

I guess we need something during replay or rejoin to apply the try_remove_dentries_for_stray logic to the loaded directories, and clean out any of these dirty dentries before going further.

#3 Updated by John Spray over 6 years ago

Created wip-11541-hammer-workaround branch. Andrey: once its built you should be able to find some packages on http://ceph.com/gitbuilder.cgi that you can try out (ask on IRC if unsure)

#4 Updated by Andrey Matyashov over 6 years ago

Hi, i many restart mds daemons, and it started normally.

#5 Updated by Zheng Yan over 6 years ago

I think the fix should be:


diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc
index 211b8b0..23a2ff9 100644
--- a/src/mds/CDir.cc
+++ b/src/mds/CDir.cc
@@ -1414,9 +1414,17 @@ void CDir::fetch(MDSInternalContextBase *c, const string& want_dn, bool ignore_a
   // unlinked directory inode shouldn't have any entry
   if (inode->inode.nlink == 0 && !inode->snaprealm) {
     dout(7) << "fetch dirfrag for unlinked directory, mark complete" << dendl;
-    if (get_version() == 0)
+    if (get_version() == 0) {
       set_version(1);
+
+      if (state_test(STATE_REJOINUNDEF)) {
+       assert(cache->mds->is_rejoin());
+       state_clear(STATE_REJOINUNDEF);
+       cache->opened_undef_dirfrag(this);
+      }
+    }
     mark_complete();
+
     if (c)
       cache->mds->queue_waiter(c);
     return;

#6 Updated by Greg Farnum over 6 years ago

  • Assignee set to Zheng Yan

Put it in a PR, please?

#7 Updated by John Spray over 6 years ago

  • Status changed from New to Pending Backport
  • Backport set to hammer

Needs backport to hammer as that's where the issue appeared.

#8 Updated by Zheng Yan over 6 years ago

this bug happens only in multimds case. do we need to backport multimds fixes?

#9 Updated by Greg Farnum over 6 years ago

Generally speaking no, but unless it's difficult to backport we might as well do so for things people have actually hit outside the lab.

#10 Updated by Loïc Dachary over 6 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
  • Target version deleted (v0.94)
  • Affected Versions deleted (v0.94)

#11 Updated by Zheng Yan over 6 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF