Bug #385
closedFailed assertion in Locker::scatter_nudge
0%
Description
I updated issue #312 but Gregory told me that it was another issue.
19:47 < gregaf> wido: your recent MDS crash is actually a different issue from #312, involving the distributed lock manager 19:48 < gregaf> are your MDSes just refusing to come up now, or is your cluster working again? 19:50 < gregaf> and what version of the code were you running when it crashed the first time?
The last log lines:
10.08.27_08:33:54.023625 7f33ea334710 mds0.journal try_to_expire waiting for nest flush on [inode 10000058e7b [...2,head] /static/kernel/linux/kernel/people/lenb/acpi/ auth v358 f(v5 m10.08.06_21:49:46.000183 3=0+3) n(v47 rc10.08.09_13:39:19.000312 b75920353 3260=3160+100) (inest sync dirty) (ifile sync dirty) (iversion lock) | dirtyscattered dirfrag dirty 0x7631e40] 10.08.27_08:33:54.023664 7f33ea334710 mds0.locker scatter_nudge auth, scatter/unscattering (inest sync dirty) on [inode 10000058e7b [...2,head] /static/kernel/linux/kernel/people/lenb/acpi/ auth v358 f(v5 m10.08.06_21:49:46.000183 3=0+3) n(v47 rc10.08.09_13:39:19.000312 b75920353 3260=3160+100) (inest sync dirty) (ifile sync dirty) (iversion lock) | dirtyscattered dirfrag dirty 0x7631e40] 10.08.27_08:33:54.023690 7f33ea334710 mds0.locker simple_lock on (inest sync dirty) on [inode 10000058e7b [...2,head] /static/kernel/linux/kernel/people/lenb/acpi/ auth v358 f(v5 m10.08.06_21:49:46.000183 3=0+3) n(v47 rc10.08.09_13:39:19.000312 b75920353 3260=3160+100) (inest sync dirty) (ifile sync dirty) (iversion lock) | dirtyscattered dirfrag dirty 0x7631e40] 10.08.27_08:33:54.023716 7f33ea334710 mds0.locker scatter_nudge oh, stable again already. mds/Locker.cc: In function 'void Locker::scatter_nudge(ScatterLock*, Context*, bool)': mds/Locker.cc:3290: FAILED assert(!c) 1: (LogSegment::try_to_expire(MDS*)+0x10f0) [0x636770] 2: (MDLog::try_expire(LogSegment*)+0x1d) [0x62ec2d] 3: (MDLog::trim(int)+0x628) [0x62f598] 4: (MDS::tick()+0x552) [0x498372] 5: (SafeTimer::EventWrapper::finish(int)+0x269) [0x6b27d9] 6: (Timer::timer_entry()+0x7bc) [0x6b4bac] 7: (Timer::TimerThread::entry()+0xd) [0x4777cd] 8: (Thread::_entry_func(void*)+0xa) [0x48a73a] 9: (()+0x69ca) [0x7f33edc9c9ca] 10: (clone()+0x6d) [0x7f33ecc546fd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
The cores, binaries and logfiles are uploaded to logger.ceph.widodh.nl:/srv/ceph/issues/mds_crash_locker_scatter_nudge
The timestamps of all the files were preserved.
Updated by Sage Weil over 13 years ago
Wido, can you let me know if this works?
diff --git a/src/mds/CInode.h b/src/mds/CInode.h index 77a768d..4cbcf05 100644 --- a/src/mds/CInode.h +++ b/src/mds/CInode.h @@ -701,7 +701,12 @@ public: lock->set_state(LOCK_EXCL); else if (issued & CEPH_CAP_GWR) lock->set_state(LOCK_MIX); - else + else if (lock->is_dirty()) { + if (is_replicated()) + lock->set_state(LOCK_MIX); + else + lock->set_state(LOCK_LOCK); + } else lock->set_state(LOCK_SYNC); } else { if (lock->is_xlocked())
Updated by Wido den Hollander over 13 years ago
No, it doesn't.
I had to apply the patch manually, please confirm it is OK what I did:
if (is_auth()) { if (issued & CEPH_CAP_GEXCL) lock->set_state(LOCK_EXCL); else if (issued & CEPH_CAP_GWR) lock->set_state(LOCK_MIX); else if (lock->is_dirty()) { if (is_replicated()) lock->set_state(LOCK_MIX); else lock->set_state(LOCK_LOCK); } else lock->set_state(LOCK_SYNC); } else { if (lock->is_xlocked()) lock->set_state(LOCK_LOCK); else lock->set_state(LOCK_SYNC); // might have been lock, previously }
The MDS crashed again, I placed the new core-dump on logger.ceph.widodh.nl ( core.cmds.node13.32754 )
Updated by Sage Weil over 13 years ago
Hi Wido,
Sorry I don't have time to really focus on this (vacation this week), but I pushed something that may take care of it to the mds_replay_lock_states branch. Can you let me know if that does the trick?
commit:0857fecbea00092251d28bc2e7625fd65bea3953
Thanks-
Updated by Wido den Hollander over 13 years ago
I tried this branch today, no luck, both MDS'es still crashed.
Uploaded two new core files to logger.ceph.widodh.nl:/srv/ceph/issues/mds_crash_locker_scatter_nudge
Updated by Sage Weil over 13 years ago
- Assignee set to Sage Weil
- Priority changed from Normal to Immediate
Updated by Sage Weil over 13 years ago
Ok this was a case of bad C++ method overloading (parent was const, child was not). Bah. Fixed by commit:ca048fb92c79cab0c0d0e6ee1cee11a037a20931.
Updated by Sage Weil over 13 years ago
rebased to commit:86986925fc10cf1632df41997d929547866109c5
Updated by John Spray over 7 years ago
- Project changed from Ceph to CephFS
- Category deleted (
1)
Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.