Project

General

Profile

Actions

Bug #6791

closed

mds assert after startup - CDir::commit error (want > commited version)

Added by Maros Vegh over 10 years ago. Updated about 10 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On upgrade from 0.67 to 0.72 i experienced the bug 6755.
I repaired the system with the ceph_filestore_tool as described in the bug 6761.
After that i started the system with wip-6761-emperor. All pgs are active+clean.

But the mds asserts after startup with the error:
mds/CDir.cc: In function 'void CDir::commit(version_t, Context*, bool)' thread 7fb82cc64700 time 2013-11-16 22:35:56.495266
mds/CDir.cc: 1718: FAILED assert(want > committed_version)

My system runs on Ubuntu 13.04


Files

ceph-mds.b.log (2.14 MB) ceph-mds.b.log Maros Vegh, 11/16/2013 01:43 PM
ceph-mds.b.log.bug6791.log10.tar.gz (36.2 MB) ceph-mds.b.log.bug6791.log10.tar.gz Maros Vegh, 11/17/2013 04:35 AM
Actions #1

Updated by Maros Vegh over 10 years ago

On a higher log level i can see that this happens during "try_to_expire" on a journal LogSegment:

-4> 2013-11-17 11:02:05.053688 7feee2234700 10 mds.0.cache.ino(10002f3a3fd) clear_dirty_parent
-3> 2013-11-17 11:02:05.053693 7feee2234700 10 mds.0.log _maybe_expired segment 3213692504680 2387 events
-2> 2013-11-17 11:02:05.053697 7feee2234700 6 mds.0.journal LogSegment(3213692504680).try_to_expire
-1> 2013-11-17 11:02:05.053707 7feee2234700 10 mds.0.cache.dir(10002f62375) commit want 0 on [dir 10002f62375 /meteo_data/opt/wrf/umbriel/input_arch/2013111506/ [2,head] auth v=453 cv=453/453 state=1073741824 f(v0 m2013-11-15 10:45:41.467229 34=34+0) n(v0 rc2013-11-15 10:45:41.467229 b288063844 34=34+0) hs=34+0,ss=0+0 dirty=18 | child=1 authpin=0 0x4ec4000]
0> 2013-11-17 11:02:05.057561 7feee2234700 -1 mds/CDir.cc: In function 'void CDir::commit(version_t, Context*, bool)' thread 7feee2234700 time 2013-11-17 11:02:05.053725
mds/CDir.cc: 1718: FAILED assert(want > committed_version)
ceph version 0.72-3-g5e1e02c (5e1e02c99b620fa4ffd2b455eb8e005b172fa05c)
1: (CDir::commit(unsigned long, Context*, bool)+0x325) [0x80beb5]
2: (LogSegment::try_to_expire(MDS*, C_GatherBuilder&)+0x214a) [0x6680aa]
3: (MDLog::try_expire(LogSegment*)+0x66) [0x85a476]
4: (MDLog::_maybe_expired(LogSegment*)+0xb2) [0x85b482]
5: (Context::complete(int)+0x9) [0x62bec9]
6: (C_Gather::delete_me()+0x16) [0x62c436]
7: (C_Gather::sub_finish(Context*, int)+0x24d) [0x62fe4d]
8: (C_Gather::C_GatherSub::finish(int)+0x12) [0x62ff52]
9: (Context::complete(int)+0x9) [0x62bec9]
10: (CInode::_stored_backtrace(unsigned long, Context*)+0x8e) [0x81882e]
11: (Context::complete(int)+0x9) [0x62bec9]
12: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x10e3) [0x87e573]
13: (MDS::handle_core_message(Message*)+0xc77) [0x64d087]
14: (MDS::_dispatch(Message*)+0x33) [0x64d1a3]
15: (MDS::ms_dispatch(Message*)+0xbb) [0x64f03b]
16: (DispatchQueue::entry()+0x4fb) [0xa2002b]
17: (DispatchQueue::DispatchThread::entry()+0xd) [0x946fad]
18: (()+0x7f8e) [0x7feee5d66f8e]
19: (clone()+0x6d) [0x7feee454da0d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #2

Updated by Zheng Yan over 10 years ago

Looks like the FS get corrupted. I suggest copying the data out and re-creating the FS.

add folowing line to ceph.conf, I hope it can avoid triggering the assertion.

mds log_max_segments = 100000

Actions #3

Updated by Maros Vegh over 10 years ago

Thanks for the advice.
The "mds log_max_segments = 100000" avoided the assertion.

I'm starting to copy the data out of the FS.

Actions #4

Updated by Loïc Dachary about 10 years ago

  • Project changed from Ceph to CephFS
Actions #5

Updated by Zheng Yan about 10 years ago

  • Status changed from New to Won't Fix
Actions

Also available in: Atom PDF