MDS EImport crashing with mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv)
I have tiny CEPH cluster (3xmon, 8xosd, 2xmds) with ceph-mds-10.2.2-2.fc24.x86_64.
Recently, one of the clients using ceph.fuse rebooted while writing to a file on CephFS (reboot was not ceph-related).
Since then, cephfs is not mountable and one MDS is in constant crash loop.
The relevant part of log (collected with debug level 20) is:
-2> 2016-08-24 11:34:32.637283 7f2820010700 10 mds.1.cache |____ 1 auth [dir 10000000000 /seurat/ [2,head] auth v=12908 cv=0/0 dir_auth=1 state=1610612736 f(v1 m2016-08-22 13:52:45.153294 5=0+5) n(v1274 rc2016-08-24 08:20:10.353155 b5238983659 7244=7185+59) hs=4+0,ss=0+0 dirty=4 | child=1 subtree=1 dirty=1 0x555590642610]
-1> 2016-08-24 11:34:32.637312 7f2820010700 10 mds.1.journal EImportStart.replay sessionmap 51587 < 51590
0> 2016-08-24 11:34:32.638430 7f2820010700 -1 mds/journal.cc: In function 'virtual void EImportStart::replay(MDSRank*)' thread 7f2820010700 time 2016-08-24 11:34:32.637337
mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv)
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x555583d922a0]
2: (EImportStart::replay(MDSRank*)+0x9e5) [0x555583c6be75]
3: (MDLog::_replay_thread()+0xe73) [0x555583beab23]
4: (MDLog::ReplayThread::entry()+0xd) [0x55558399584d]
5: (()+0x75ca) [0x7f282c00e5ca]
6: (clone()+0x6d) [0x7f282aa4ef6d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I'm uploading logfile created with lower log level (level 20 leaks too much about our data)
#1 Updated by Greg Farnum about 3 years ago
- Status changed from New to Need More Info
- Source changed from other to Community (user)
It looks like you're running with multiple active MDSes, which is not currently recommended. We saw this in #16043 as well, but in that case a user had tried to use repair tools incorrectly.
In order to diagnose this we would need at least full logs of a replay with full debug settings (including "debug mds log = 20"). You can upload them with ceph-post-file and they'll only be accessible to Ceph devs, if you're willing...
You should be able to recover from this with no/minimal data loss by doing the journal scavenge operations, resetting the journals, and resetting the sessionmap. See http://docs.ceph.com/docs/master/cephfs/disaster-recovery/, and then stick with one active MDS!
#2 Updated by Tomasz Torcz about 3 years ago
Full log was uploaded ceph-post-file: 610fd186-9150-4e6b-8050-37dc314af39b
Before I recover, I'd really like to see this bug fixed. Allowing client to break whole FS cluster is bad. I understand that this may be one of "don't use multiple MDSs" situation, but still.
P.S. ceph-post-file uses DSA key for authorisation. DSA keys were deprecated with OpenSSH and are not used by default, so "PubkeyAcceptedKeyTypes +ssh-dss" need to be added to SSH config.
#3 Updated by Greg Farnum about 3 years ago
- Project changed from Ceph to fs
- Subject changed from MDS crashing with mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv) to MDS EImport crashing with mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv)
- Category set to 90
- Status changed from Need More Info to Verified
- Component(FS) MDS added
It's not super-likely the rebooting client actually caused this problem. If it did, it was only incidentally, and it's definitely a problem exclusive to multi-mds systems. No guarantees how long it will take until we diagnose and fix.
Thanks for the ceph-post-file ssh thing, made a ticket: http://tracker.ceph.com/issues/17137
#7 Updated by Zheng Yan over 2 years ago
- Status changed from Verified to Can't reproduce
it's likely fixed by https://github.com/ceph/ceph/commit/a49726e10ef23be124d92872470fd258a1938d9e