Project

General

Profile

Bug #17113

MDS EImport crashing with mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv)

Added by Tomasz Torcz about 3 years ago. Updated 6 months ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
Start date:
08/24/2016
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
MDS
Labels (FS):
multimds
Pull request ID:

Description

I have tiny CEPH cluster (3xmon, 8xosd, 2xmds) with ceph-mds-10.2.2-2.fc24.x86_64.
Recently, one of the clients using ceph.fuse rebooted while writing to a file on CephFS (reboot was not ceph-related).
Since then, cephfs is not mountable and one MDS is in constant crash loop.

The relevant part of log (collected with debug level 20) is:

-2> 2016-08-24 11:34:32.637283 7f2820010700 10 mds.1.cache   |____ 1    auth [dir 10000000000 /seurat/ [2,head] auth v=12908 cv=0/0 dir_auth=1 state=1610612736 f(v1 m2016-08-22 13:52:45.153294 5=0+5) n(v1274 rc2016-08-24 08:20:10.353155 b5238983659 7244=7185+59) hs=4+0,ss=0+0 dirty=4 | child=1 subtree=1 dirty=1 0x555590642610]
-1> 2016-08-24 11:34:32.637312 7f2820010700 10 mds.1.journal EImportStart.replay sessionmap 51587 < 51590
0> 2016-08-24 11:34:32.638430 7f2820010700 -1 mds/journal.cc: In function 'virtual void EImportStart::replay(MDSRank*)' thread 7f2820010700 time 2016-08-24 11:34:32.637337
mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv)
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x555583d922a0]
2: (EImportStart::replay(MDSRank*)+0x9e5) [0x555583c6be75]
3: (MDLog::_replay_thread()+0xe73) [0x555583beab23]
4: (MDLog::ReplayThread::entry()+0xd) [0x55558399584d]
5: (()+0x75ca) [0x7f282c00e5ca]
6: (clone()+0x6d) [0x7f282aa4ef6d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

I'm uploading logfile created with lower log level (level 20 leaks too much about our data)

ceph-mds.switcheroo.log View (127 KB) Tomasz Torcz, 08/24/2016 09:44 AM

History

#1 Updated by Greg Farnum about 3 years ago

  • Status changed from New to Need More Info
  • Source changed from other to Community (user)

It looks like you're running with multiple active MDSes, which is not currently recommended. We saw this in #16043 as well, but in that case a user had tried to use repair tools incorrectly.

In order to diagnose this we would need at least full logs of a replay with full debug settings (including "debug mds log = 20"). You can upload them with ceph-post-file and they'll only be accessible to Ceph devs, if you're willing...

You should be able to recover from this with no/minimal data loss by doing the journal scavenge operations, resetting the journals, and resetting the sessionmap. See http://docs.ceph.com/docs/master/cephfs/disaster-recovery/, and then stick with one active MDS!

#2 Updated by Tomasz Torcz about 3 years ago

Full log was uploaded ceph-post-file: 610fd186-9150-4e6b-8050-37dc314af39b

Before I recover, I'd really like to see this bug fixed. Allowing client to break whole FS cluster is bad. I understand that this may be one of "don't use multiple MDSs" situation, but still.

P.S. ceph-post-file uses DSA key for authorisation. DSA keys were deprecated with OpenSSH and are not used by default, so "PubkeyAcceptedKeyTypes +ssh-dss" need to be added to SSH config.

#3 Updated by Greg Farnum about 3 years ago

  • Project changed from Ceph to fs
  • Subject changed from MDS crashing with mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv) to MDS EImport crashing with mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv)
  • Category set to 90
  • Status changed from Need More Info to Verified
  • Component(FS) MDS added

It's not super-likely the rebooting client actually caused this problem. If it did, it was only incidentally, and it's definitely a problem exclusive to multi-mds systems. No guarantees how long it will take until we diagnose and fix.

Thanks for the ceph-post-file ssh thing, made a ticket: http://tracker.ceph.com/issues/17137

#4 Updated by Tomasz Torcz about 3 years ago

Will full logs be enough for diagnose?
I'd like to start recovering this cluster, but if you would need me to run additional debugs I will wait.

#5 Updated by Greg Farnum about 3 years ago

I think the logs you've provided should be enough. Thanks!

#6 Updated by John Spray almost 3 years ago

  • Priority changed from Normal to High
  • Target version set to v12.0.0

#7 Updated by Zheng Yan over 2 years ago

  • Status changed from Verified to Can't reproduce

#8 Updated by Patrick Donnelly 6 months ago

  • Category deleted (90)
  • Labels (FS) multimds added

Also available in: Atom PDF