Bug #395: mds: interval_set assert(0) during journal replay - CephFS - Ceph

Actions

Copy link

Bug #395

closed

mds: interval_set assert(0) during journal replay

Added by Sage Weil over 13 years ago. Updated over 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

From ML:

Date: Wed, 8 Sep 2010 14:45:04 +1000
From: Nat N <phenisha@gmail.com>
To: ceph-devel@vger.kernel.org
Subject: MDS crashing

Hi I am testing the ceph file system, all has been going OK but now it
seems my cmds is crasing with the following error

.... <snip> ...
10.09.08_13:48:40.146886 419dc940 -- 172.17.8.3:6802/8771 <== osd8
172.17.8.11:6800/8930 7 ==== osd_op_reply(28 200.00000ef9 [read
0~4194304] = 0) v1 ==== 98+0+4194304 (1203150032 0 2774819477)
0xa22080
10.09.08_13:48:40.147220 44e45940 mds0.cache creating system inode with ino:100
10.09.08_13:48:41.293977 4333f940 -- 172.17.8.3:6802/8771 --> mon2
172.17.8.4:6789/0 -- mdsbeacon(8900/thorium003 up:replay seq 34 v212)
v1 -- ?+0 0x2145500
10.09.08_13:48:41.295762 419dc940 -- 172.17.8.3:6802/8771 <== mon2
172.17.8.4:6789/0 48 ==== mdsbeacon(8900/thorium003 up:replay seq 34
v212) v2 ==== 112+0+0 (2962285251 0 0) 0x2145500
./include/interval_set.h: In function 'void interval_set<T>::insert(T,
T) [with T = inodeno_t]':
./include/interval_set.h:202: FAILED assert(0)
 1: (EMetaBlob::replay(MDS*, LogSegment*)+0x3f75) [0x691625]
 2: (EUpdate::replay(MDS*)+0x38) [0x694d28]
 3: (MDLog::_replay_thread()+0x68e) [0x68801e]
 4: (MDLog::ReplayThread::entry()+0xd) [0x4bb3cd]
 5: (Thread::_entry_func(void*)+0xa) [0x49c71a]
 6: /lib64/libpthread.so.0 [0x31d960673d]
 7: (clone()+0x6d) [0x31d8ed3d1d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

I am using the unstable git branch as well as kernel .35, one mds and
3 monitors with around 10 osds

unfortuantely I do not have access to the core files but please find
the objdump cmds here:
http://www.geopersonalassistant.com/dump/cmds.dump.gz

Actions

Copy link

Updated by Sage Weil over 13 years ago

The problem was a session close event, followed by an open. The close didn't clear the session state, I believe because the client had already reconnected. This should fix it:

diff --git a/src/mds/journal.cc b/src/mds/journal.cc
index ec2013d..64fc6a3 100644
--- a/src/mds/journal.cc
+++ b/src/mds/journal.cc
@@ -725,6 +725,8 @@ void ESession::replay(MDS *mds)
       Session *session = mds->sessionmap.get_session(client_inst.name);
       if (session->is_closed())
        mds->sessionmap.remove_session(session);
+      else
+       session->clear();    // the client has reconnected; keep the Session, but reset
     }
   }

Actions

Copy link