Bug #6791
closedmds assert after startup - CDir::commit error (want > commited version)
0%
Description
On upgrade from 0.67 to 0.72 i experienced the bug 6755.
I repaired the system with the ceph_filestore_tool as described in the bug 6761.
After that i started the system with wip-6761-emperor. All pgs are active+clean.
But the mds asserts after startup with the error:
mds/CDir.cc: In function 'void CDir::commit(version_t, Context*, bool)' thread 7fb82cc64700 time 2013-11-16 22:35:56.495266
mds/CDir.cc: 1718: FAILED assert(want > committed_version)
My system runs on Ubuntu 13.04
Files
Updated by Maros Vegh over 10 years ago
On a higher log level i can see that this happens during "try_to_expire" on a journal LogSegment:
-4> 2013-11-17 11:02:05.053688 7feee2234700 10 mds.0.cache.ino(10002f3a3fd) clear_dirty_parent
-3> 2013-11-17 11:02:05.053693 7feee2234700 10 mds.0.log _maybe_expired segment 3213692504680 2387 events
-2> 2013-11-17 11:02:05.053697 7feee2234700 6 mds.0.journal LogSegment(3213692504680).try_to_expire
-1> 2013-11-17 11:02:05.053707 7feee2234700 10 mds.0.cache.dir(10002f62375) commit want 0 on [dir 10002f62375 /meteo_data/opt/wrf/umbriel/input_arch/2013111506/ [2,head] auth v=453 cv=453/453 state=1073741824 f(v0 m2013-11-15 10:45:41.467229 34=34+0) n(v0 rc2013-11-15 10:45:41.467229 b288063844 34=34+0) hs=34+0,ss=0+0 dirty=18 | child=1 authpin=0 0x4ec4000]
0> 2013-11-17 11:02:05.057561 7feee2234700 -1 mds/CDir.cc: In function 'void CDir::commit(version_t, Context*, bool)' thread 7feee2234700 time 2013-11-17 11:02:05.053725
mds/CDir.cc: 1718: FAILED assert(want > committed_version)
ceph version 0.72-3-g5e1e02c (5e1e02c99b620fa4ffd2b455eb8e005b172fa05c)
1: (CDir::commit(unsigned long, Context*, bool)+0x325) [0x80beb5]
2: (LogSegment::try_to_expire(MDS*, C_GatherBuilder&)+0x214a) [0x6680aa]
3: (MDLog::try_expire(LogSegment*)+0x66) [0x85a476]
4: (MDLog::_maybe_expired(LogSegment*)+0xb2) [0x85b482]
5: (Context::complete(int)+0x9) [0x62bec9]
6: (C_Gather::delete_me()+0x16) [0x62c436]
7: (C_Gather::sub_finish(Context*, int)+0x24d) [0x62fe4d]
8: (C_Gather::C_GatherSub::finish(int)+0x12) [0x62ff52]
9: (Context::complete(int)+0x9) [0x62bec9]
10: (CInode::_stored_backtrace(unsigned long, Context*)+0x8e) [0x81882e]
11: (Context::complete(int)+0x9) [0x62bec9]
12: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x10e3) [0x87e573]
13: (MDS::handle_core_message(Message*)+0xc77) [0x64d087]
14: (MDS::_dispatch(Message*)+0x33) [0x64d1a3]
15: (MDS::ms_dispatch(Message*)+0xbb) [0x64f03b]
16: (DispatchQueue::entry()+0x4fb) [0xa2002b]
17: (DispatchQueue::DispatchThread::entry()+0xd) [0x946fad]
18: (()+0x7f8e) [0x7feee5d66f8e]
19: (clone()+0x6d) [0x7feee454da0d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Zheng Yan over 10 years ago
Looks like the FS get corrupted. I suggest copying the data out and re-creating the FS.
add folowing line to ceph.conf, I hope it can avoid triggering the assertion.
mds log_max_segments = 100000
Updated by Maros Vegh over 10 years ago
Thanks for the advice.
The "mds log_max_segments = 100000" avoided the assertion.
I'm starting to copy the data out of the FS.