Project

General

Profile

Bug #23250

mds: crash during replay: interval_set.h: 396: FAILED assert(p->first > start+len)

Added by Christoffer Lilja about 6 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

MDS crash during replay
Full log attached.

starting mds.orbit at -
/build/ceph-12.2.4/src/include/interval_set.h: In function 'void interval_set<T>::insert(T, T, T*, T*) [with T = inodeno_t]' thread 7fba87588700 time 2018-03-06 18:47:24.258340
/build/ceph-12.2.4/src/include/interval_set.h: 396: FAILED assert(p->first > start+len)
 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55ebf9502942]
 2: (InoTable::replay_release_ids(interval_set<inodeno_t>&)+0x9f7) [0x55ebf94051a7]
 3: (ESession::replay(MDSRank*)+0x3f0) [0x55ebf9491020]
 4: (MDLog::_replay_thread()+0xc6b) [0x55ebf94554bb]
 5: (MDLog::ReplayThread::entry()+0xd) [0x55ebf91d0fcd]
 6: (()+0x76ba) [0x7fba9488a6ba]
 7: (clone()+0x6d) [0x7fba938f641d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2018-03-06 18:47:24.259376 7fba87588700 -1 /build/ceph-12.2.4/src/include/interval_set.h: In function 'void interval_set<T>::insert(T, T, T*, T*) [with T = inodeno_t]' thread 7fba87588700 time 2018-03-06 18:47:24.258340
/build/ceph-12.2.4/src/include/interval_set.h: 396: FAILED assert(p->first > start+len)

 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55ebf9502942]
 2: (InoTable::replay_release_ids(interval_set<inodeno_t>&)+0x9f7) [0x55ebf94051a7]
 3: (ESession::replay(MDSRank*)+0x3f0) [0x55ebf9491020]
 4: (MDLog::_replay_thread()+0xc6b) [0x55ebf94554bb]
 5: (MDLog::ReplayThread::entry()+0xd) [0x55ebf91d0fcd]
 6: (()+0x76ba) [0x7fba9488a6ba]
 7: (clone()+0x6d) [0x7fba938f641d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2018-03-06 18:47:24.259376 7fba87588700 -1 /build/ceph-12.2.4/src/include/interval_set.h: In function 'void interval_set<T>::insert(T, T, T*, T*) [with T = inodeno_t]' thread 7fba87588700 time 2018-03-06 18:47:24.258340
/build/ceph-12.2.4/src/include/interval_set.h: 396: FAILED assert(p->first > start+len)

 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55ebf9502942]
 2: (InoTable::replay_release_ids(interval_set<inodeno_t>&)+0x9f7) [0x55ebf94051a7]
 3: (ESession::replay(MDSRank*)+0x3f0) [0x55ebf9491020]
 4: (MDLog::_replay_thread()+0xc6b) [0x55ebf94554bb]
 5: (MDLog::ReplayThread::entry()+0xd) [0x55ebf91d0fcd]
 6: (()+0x76ba) [0x7fba9488a6ba]
 7: (clone()+0x6d) [0x7fba938f641d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

*** Caught signal (Aborted) **
 in thread 7fba87588700 thread_name:md_log_replay
 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (()+0x5ab254) [0x55ebf94bc254]
 2: (()+0x11390) [0x7fba94894390]
 3: (gsignal()+0x38) [0x7fba93824428]
 4: (abort()+0x16a) [0x7fba9382602a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55ebf9502ace]
 6: (InoTable::replay_release_ids(interval_set<inodeno_t>&)+0x9f7) [0x55ebf94051a7]
 7: (ESession::replay(MDSRank*)+0x3f0) [0x55ebf9491020]
 8: (MDLog::_replay_thread()+0xc6b) [0x55ebf94554bb]
 9: (MDLog::ReplayThread::entry()+0xd) [0x55ebf91d0fcd]
 10: (()+0x76ba) [0x7fba9488a6ba]
 11: (clone()+0x6d) [0x7fba938f641d]
2018-03-06 18:47:24.261559 7fba87588700 -1 *** Caught signal (Aborted) **
 in thread 7fba87588700 thread_name:md_log_replay

 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (()+0x5ab254) [0x55ebf94bc254]
 2: (()+0x11390) [0x7fba94894390]
 3: (gsignal()+0x38) [0x7fba93824428]
 4: (abort()+0x16a) [0x7fba9382602a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55ebf9502ace]
 6: (InoTable::replay_release_ids(interval_set<inodeno_t>&)+0x9f7) [0x55ebf94051a7]
 7: (ESession::replay(MDSRank*)+0x3f0) [0x55ebf9491020]
 8: (MDLog::_replay_thread()+0xc6b) [0x55ebf94554bb]
 9: (MDLog::ReplayThread::entry()+0xd) [0x55ebf91d0fcd]
 10: (()+0x76ba) [0x7fba9488a6ba]
 11: (clone()+0x6d) [0x7fba938f641d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2018-03-06 18:47:24.261559 7fba87588700 -1 *** Caught signal (Aborted) **
 in thread 7fba87588700 thread_name:md_log_replay

 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (()+0x5ab254) [0x55ebf94bc254]
 2: (()+0x11390) [0x7fba94894390]
 3: (gsignal()+0x38) [0x7fba93824428]
 4: (abort()+0x16a) [0x7fba9382602a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55ebf9502ace]
 6: (InoTable::replay_release_ids(interval_set<inodeno_t>&)+0x9f7) [0x55ebf94051a7]
 7: (ESession::replay(MDSRank*)+0x3f0) [0x55ebf9491020]
 8: (MDLog::_replay_thread()+0xc6b) [0x55ebf94554bb]
 9: (MDLog::ReplayThread::entry()+0xd) [0x55ebf91d0fcd]
 10: (()+0x76ba) [0x7fba9488a6ba]
 11: (clone()+0x6d) [0x7fba938f641d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Aborted (core dumped)

ceph-mds.orbit.log.gz (312 KB) Christoffer Lilja, 03/06/2018 06:17 PM

History

#1 Updated by Christoffer Lilja about 6 years ago

Here comes a log where "debug mds = 20" was enabled.

Due to the big size i share it through my google drive:
https://drive.google.com/open?id=1tdj8cblEjzqhM51Dgv3MKMo3ZEn_ftmS

#2 Updated by Christoffer Lilja about 6 years ago

New link to MDS log, the other didn't work by any reason:
https://drive.google.com/open?id=1S1aAbst5yGIBpbUAG1IfkoFTPSeSzvbV

#3 Updated by Sage Weil about 6 years ago

  • Project changed from bluestore to CephFS

#4 Updated by Patrick Donnelly about 6 years ago

  • Subject changed from MDS crash during replay to mds: crash during replay: interval_set.h: 396: FAILED assert(p->first > start+len)
  • Description updated (diff)
  • Source set to Community (user)
  • Component(FS) MDS added

#5 Updated by Christoffer Lilja about 6 years ago

I managed to get past this crash this way:
  • cephfs-journal-tool journal export backup.bin (ofcourse backup, even if it's just a test)
  • cephfs-journal-tool event recover_dentries summary
  • cephfs-journal-tool journal reset
  • cephfs-table-tool all reset session
  • ceph fs reset cephfs --yes-i-really-mean-it

(taken directly from http://docs.ceph.com/docs/luminous/cephfs/disaster-recovery/)

This was only for test before I was about to scrap my old Ceph setup and restore all the files from backup.
I don't say that nothing is lost or isn't corrupt in any way here, but now I manage to start my MDS servers and mount the CephFS filesystem anyway.

I hope this helps someone out there.

#6 Updated by Patrick Donnelly about 6 years ago

  • Assignee set to Zheng Yan

#7 Updated by Zheng Yan about 6 years ago

looks like InoTable::repair is buggy (it shouldn't increase inotable version without submitting a log event). did you run scrub before this crash happened?

#8 Updated by Christoffer Lilja about 6 years ago

I haven't run any MDS scrub, never found how to properly do that. Did a PG scrub of all metadata PG's though.

#9 Updated by Zheng Yan about 6 years ago

No, PG scrub has nothing do with metadata scrub. No idea what caused the corruption.

#10 Updated by Zheng Yan almost 6 years ago

  • Status changed from New to Need More Info

#11 Updated by Zheng Yan over 5 years ago

  • Status changed from Need More Info to Closed

Also available in: Atom PDF