Bug #8811
closedJournal corruption during upgrade to 0.82 with standby-replay daemons
0%
Description
Two different ceph-users reports of hitting this issue on v0.82:
0> 2014-07-09 23:21:43.385274 7fb7f7b83700 -1 mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7fb7f7b83700 time 2014-07-09 23:21:43.383304 mds/MDLog.cc: 815: FAILED assert(journaler->is_readable()) ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) 1: (MDLog::_replay_thread()+0x197b) [0x85a3cb] 2: (MDLog::ReplayThread::entry()+0xd) [0x66466d] 3: (()+0x8062) [0x7fb7ffda1062] 4: (clone()+0x6d) [0x7fb7feb35a3d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
0> 2014-07-10 11:35:36.107022 7f45f7c57700 -1 mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7f45f7c57700 time 2014-07-10 11:35:36.103147 mds/MDLog.cc: 815: FAILED assert(journaler->is_readable()) ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) 1: (MDLog::_replay_thread()+0x197b) [0x85a3cb] 2: (MDLog::ReplayThread::entry()+0xd) [0x66466d] 3: (()+0x6b50) [0x7f45ffdd7b50] 4: (clone()+0x6d) [0x7f45fec000ed] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I went over the code a little bit and it looks good to me, but we just made the JournalStream changes so I'm sure that's the issue. For context, this MDLog assert follows a loop that waits until the Journaler is readable, so it appears to be changing its mind...presumably we're incorrectly manipulating the read_bug in some way?
Updated by John Spray almost 10 years ago
Hmmm. Aside from is_readable() giving inconsistent results, seems like this could happen if there was a bug that caused read_pos to get ahead of write_pos, because the check at the top of the _replay_thread loop is for get_read_pos() < get_write_pos(), but the check right before the assertion is for ==.
Updated by John Spray almost 10 years ago
- Status changed from New to In Progress
This may be the result of a bug in the journal reformatting that occurs during upgrade, affecting systems using standby-replay MDS daemons. Journal corruption can occur when both an active and a standby-replay daemon attempt to do the rewrite at the same time.
Updated by John Spray almost 10 years ago
- Status changed from In Progress to Fix Under Review
Updated by John Spray almost 10 years ago
Updated by John Spray almost 10 years ago
- Subject changed from MDLog::is_readable() assert to Journal corruption during upgrade to 0.82 with standby-replay daemons
Updated by Greg Farnum over 9 years ago
- Status changed from Fix Under Review to Resolved
This got fixed 11 days ago, but was never marked closed. Merged in commit:b9463e3497cc1f2a1bab0838430a4402d8c88af0