Bug #23250
mds: crash during replay: interval_set.h: 396: FAILED assert(p->first > start+len)
0%
Description
MDS crash during replay
Full log attached.
starting mds.orbit at - /build/ceph-12.2.4/src/include/interval_set.h: In function 'void interval_set<T>::insert(T, T, T*, T*) [with T = inodeno_t]' thread 7fba87588700 time 2018-03-06 18:47:24.258340 /build/ceph-12.2.4/src/include/interval_set.h: 396: FAILED assert(p->first > start+len) ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55ebf9502942] 2: (InoTable::replay_release_ids(interval_set<inodeno_t>&)+0x9f7) [0x55ebf94051a7] 3: (ESession::replay(MDSRank*)+0x3f0) [0x55ebf9491020] 4: (MDLog::_replay_thread()+0xc6b) [0x55ebf94554bb] 5: (MDLog::ReplayThread::entry()+0xd) [0x55ebf91d0fcd] 6: (()+0x76ba) [0x7fba9488a6ba] 7: (clone()+0x6d) [0x7fba938f641d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2018-03-06 18:47:24.259376 7fba87588700 -1 /build/ceph-12.2.4/src/include/interval_set.h: In function 'void interval_set<T>::insert(T, T, T*, T*) [with T = inodeno_t]' thread 7fba87588700 time 2018-03-06 18:47:24.258340 /build/ceph-12.2.4/src/include/interval_set.h: 396: FAILED assert(p->first > start+len) ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55ebf9502942] 2: (InoTable::replay_release_ids(interval_set<inodeno_t>&)+0x9f7) [0x55ebf94051a7] 3: (ESession::replay(MDSRank*)+0x3f0) [0x55ebf9491020] 4: (MDLog::_replay_thread()+0xc6b) [0x55ebf94554bb] 5: (MDLog::ReplayThread::entry()+0xd) [0x55ebf91d0fcd] 6: (()+0x76ba) [0x7fba9488a6ba] 7: (clone()+0x6d) [0x7fba938f641d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 0> 2018-03-06 18:47:24.259376 7fba87588700 -1 /build/ceph-12.2.4/src/include/interval_set.h: In function 'void interval_set<T>::insert(T, T, T*, T*) [with T = inodeno_t]' thread 7fba87588700 time 2018-03-06 18:47:24.258340 /build/ceph-12.2.4/src/include/interval_set.h: 396: FAILED assert(p->first > start+len) ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55ebf9502942] 2: (InoTable::replay_release_ids(interval_set<inodeno_t>&)+0x9f7) [0x55ebf94051a7] 3: (ESession::replay(MDSRank*)+0x3f0) [0x55ebf9491020] 4: (MDLog::_replay_thread()+0xc6b) [0x55ebf94554bb] 5: (MDLog::ReplayThread::entry()+0xd) [0x55ebf91d0fcd] 6: (()+0x76ba) [0x7fba9488a6ba] 7: (clone()+0x6d) [0x7fba938f641d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. *** Caught signal (Aborted) ** in thread 7fba87588700 thread_name:md_log_replay ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable) 1: (()+0x5ab254) [0x55ebf94bc254] 2: (()+0x11390) [0x7fba94894390] 3: (gsignal()+0x38) [0x7fba93824428] 4: (abort()+0x16a) [0x7fba9382602a] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55ebf9502ace] 6: (InoTable::replay_release_ids(interval_set<inodeno_t>&)+0x9f7) [0x55ebf94051a7] 7: (ESession::replay(MDSRank*)+0x3f0) [0x55ebf9491020] 8: (MDLog::_replay_thread()+0xc6b) [0x55ebf94554bb] 9: (MDLog::ReplayThread::entry()+0xd) [0x55ebf91d0fcd] 10: (()+0x76ba) [0x7fba9488a6ba] 11: (clone()+0x6d) [0x7fba938f641d] 2018-03-06 18:47:24.261559 7fba87588700 -1 *** Caught signal (Aborted) ** in thread 7fba87588700 thread_name:md_log_replay ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable) 1: (()+0x5ab254) [0x55ebf94bc254] 2: (()+0x11390) [0x7fba94894390] 3: (gsignal()+0x38) [0x7fba93824428] 4: (abort()+0x16a) [0x7fba9382602a] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55ebf9502ace] 6: (InoTable::replay_release_ids(interval_set<inodeno_t>&)+0x9f7) [0x55ebf94051a7] 7: (ESession::replay(MDSRank*)+0x3f0) [0x55ebf9491020] 8: (MDLog::_replay_thread()+0xc6b) [0x55ebf94554bb] 9: (MDLog::ReplayThread::entry()+0xd) [0x55ebf91d0fcd] 10: (()+0x76ba) [0x7fba9488a6ba] 11: (clone()+0x6d) [0x7fba938f641d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 0> 2018-03-06 18:47:24.261559 7fba87588700 -1 *** Caught signal (Aborted) ** in thread 7fba87588700 thread_name:md_log_replay ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable) 1: (()+0x5ab254) [0x55ebf94bc254] 2: (()+0x11390) [0x7fba94894390] 3: (gsignal()+0x38) [0x7fba93824428] 4: (abort()+0x16a) [0x7fba9382602a] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55ebf9502ace] 6: (InoTable::replay_release_ids(interval_set<inodeno_t>&)+0x9f7) [0x55ebf94051a7] 7: (ESession::replay(MDSRank*)+0x3f0) [0x55ebf9491020] 8: (MDLog::_replay_thread()+0xc6b) [0x55ebf94554bb] 9: (MDLog::ReplayThread::entry()+0xd) [0x55ebf91d0fcd] 10: (()+0x76ba) [0x7fba9488a6ba] 11: (clone()+0x6d) [0x7fba938f641d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Aborted (core dumped)
History
#1 Updated by Christoffer Lilja about 6 years ago
Here comes a log where "debug mds = 20" was enabled.
Due to the big size i share it through my google drive:
https://drive.google.com/open?id=1tdj8cblEjzqhM51Dgv3MKMo3ZEn_ftmS
#2 Updated by Christoffer Lilja about 6 years ago
New link to MDS log, the other didn't work by any reason:
https://drive.google.com/open?id=1S1aAbst5yGIBpbUAG1IfkoFTPSeSzvbV
#3 Updated by Sage Weil about 6 years ago
- Project changed from bluestore to CephFS
#4 Updated by Patrick Donnelly about 6 years ago
- Subject changed from MDS crash during replay to mds: crash during replay: interval_set.h: 396: FAILED assert(p->first > start+len)
- Description updated (diff)
- Source set to Community (user)
- Component(FS) MDS added
#5 Updated by Christoffer Lilja about 6 years ago
- cephfs-journal-tool journal export backup.bin (ofcourse backup, even if it's just a test)
- cephfs-journal-tool event recover_dentries summary
- cephfs-journal-tool journal reset
- cephfs-table-tool all reset session
- ceph fs reset cephfs --yes-i-really-mean-it
(taken directly from http://docs.ceph.com/docs/luminous/cephfs/disaster-recovery/)
This was only for test before I was about to scrap my old Ceph setup and restore all the files from backup.
I don't say that nothing is lost or isn't corrupt in any way here, but now I manage to start my MDS servers and mount the CephFS filesystem anyway.
I hope this helps someone out there.
#6 Updated by Patrick Donnelly about 6 years ago
- Assignee set to Zheng Yan
#7 Updated by Zheng Yan about 6 years ago
looks like InoTable::repair is buggy (it shouldn't increase inotable version without submitting a log event). did you run scrub before this crash happened?
#8 Updated by Christoffer Lilja about 6 years ago
I haven't run any MDS scrub, never found how to properly do that. Did a PG scrub of all metadata PG's though.
#9 Updated by Zheng Yan about 6 years ago
No, PG scrub has nothing do with metadata scrub. No idea what caused the corruption.
#10 Updated by Zheng Yan almost 6 years ago
- Status changed from New to Need More Info
#11 Updated by Zheng Yan over 5 years ago
- Status changed from Need More Info to Closed