Bug #8811: Journal corruption during upgrade to 0.82 with standby-replay daemons - CephFS - Ceph

Actions

Copy link

Bug #8811

closed

Journal corruption during upgrade to 0.82 with standby-replay daemons

Added by Greg Farnum almost 10 years ago. Updated almost 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

John Spray

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Two different ceph-users reports of hitting this issue on v0.82:

0> 2014-07-09 23:21:43.385274 7fb7f7b83700 -1 mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7fb7f7b83700 time 2014-07-09 23:21:43.383304                                                        
mds/MDLog.cc: 815: FAILED assert(journaler->is_readable())                                                                                                                                                             

 ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)                                                                                                                                                          
 1: (MDLog::_replay_thread()+0x197b) [0x85a3cb]                                                                                                                                                                        
 2: (MDLog::ReplayThread::entry()+0xd) [0x66466d]                                                                                                                                                                      
 3: (()+0x8062) [0x7fb7ffda1062]                                                                                                                                                                                       
 4: (clone()+0x6d) [0x7fb7feb35a3d]                                                                                                                                                                                    
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2014-07-10 11:35:36.107022 7f45f7c57700 -1 mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7f45f7c57700 time 2014-07-10 11:35:36.103147
mds/MDLog.cc: 815: FAILED assert(journaler->is_readable())

 ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
 1: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
 2: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
 3: (()+0x6b50) [0x7f45ffdd7b50]
 4: (clone()+0x6d) [0x7f45fec000ed]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I went over the code a little bit and it looks good to me, but we just made the JournalStream changes so I'm sure that's the issue. For context, this MDLog assert follows a loop that waits until the Journaler is readable, so it appears to be changing its mind...presumably we're incorrectly manipulating the read_bug in some way?

Actions

Copy link

Updated by John Spray almost 10 years ago

Hmmm. Aside from is_readable() giving inconsistent results, seems like this could happen if there was a bug that caused read_pos to get ahead of write_pos, because the check at the top of the _replay_thread loop is for get_read_pos() < get_write_pos(), but the check right before the assertion is for ==.

Actions

Copy link

Updated by John Spray almost 10 years ago

Status changed from New to In Progress

This may be the result of a bug in the journal reformatting that occurs during upgrade, affecting systems using standby-replay MDS daemons. Journal corruption can occur when both an active and a standby-replay daemon attempt to do the rewrite at the same time.

Actions

Copy link