Project

General

Profile

Actions

Bug #8811

closed

Journal corruption during upgrade to 0.82 with standby-replay daemons

Added by Greg Farnum almost 10 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Two different ceph-users reports of hitting this issue on v0.82:

0> 2014-07-09 23:21:43.385274 7fb7f7b83700 -1 mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7fb7f7b83700 time 2014-07-09 23:21:43.383304                                                        
mds/MDLog.cc: 815: FAILED assert(journaler->is_readable())                                                                                                                                                             

 ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)                                                                                                                                                          
 1: (MDLog::_replay_thread()+0x197b) [0x85a3cb]                                                                                                                                                                        
 2: (MDLog::ReplayThread::entry()+0xd) [0x66466d]                                                                                                                                                                      
 3: (()+0x8062) [0x7fb7ffda1062]                                                                                                                                                                                       
 4: (clone()+0x6d) [0x7fb7feb35a3d]                                                                                                                                                                                    
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 

     0> 2014-07-10 11:35:36.107022 7f45f7c57700 -1 mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7f45f7c57700 time 2014-07-10 11:35:36.103147
mds/MDLog.cc: 815: FAILED assert(journaler->is_readable())

 ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
 1: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
 2: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
 3: (()+0x6b50) [0x7f45ffdd7b50]
 4: (clone()+0x6d) [0x7f45fec000ed]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I went over the code a little bit and it looks good to me, but we just made the JournalStream changes so I'm sure that's the issue. For context, this MDLog assert follows a loop that waits until the Journaler is readable, so it appears to be changing its mind...presumably we're incorrectly manipulating the read_bug in some way?

Actions #1

Updated by John Spray almost 10 years ago

Hmmm. Aside from is_readable() giving inconsistent results, seems like this could happen if there was a bug that caused read_pos to get ahead of write_pos, because the check at the top of the _replay_thread loop is for get_read_pos() < get_write_pos(), but the check right before the assertion is for ==.

Actions #2

Updated by John Spray almost 10 years ago

  • Status changed from New to In Progress

This may be the result of a bug in the journal reformatting that occurs during upgrade, affecting systems using standby-replay MDS daemons. Journal corruption can occur when both an active and a standby-replay daemon attempt to do the rewrite at the same time.

Actions #3

Updated by John Spray almost 10 years ago

  • Status changed from In Progress to Fix Under Review
Actions #5

Updated by John Spray almost 10 years ago

  • Subject changed from MDLog::is_readable() assert to Journal corruption during upgrade to 0.82 with standby-replay daemons
Actions #6

Updated by Greg Farnum over 9 years ago

  • Status changed from Fix Under Review to Resolved

This got fixed 11 days ago, but was never marked closed. Merged in commit:b9463e3497cc1f2a1bab0838430a4402d8c88af0

Actions #7

Updated by Greg Farnum almost 8 years ago

  • Component(FS) MDS added
Actions

Also available in: Atom PDF