Project

General

Profile

Actions

Bug #9482

closed

osd/PGLog.cc: 544: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

Added by Samuel Just over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

backfill peers are not setting info.last_epoch_started allowing subsequent primaries to erroneously conclude that it is 0.


Files

Actions #1

Updated by Samuel Just over 9 years ago

  • Priority changed from Normal to Urgent

Actually, I'm not sure that's right. Thinking.

Actions #2

Updated by Samuel Just over 9 years ago

Yeah, pretty sure that's right, even if we only find backfill peers, we want to let those determine the min acceptable last update.

Actions #3

Updated by Samuel Just over 9 years ago

  • Status changed from New to Fix Under Review
  • Assignee set to Samuel Just

Can't set info.last_epoch_started there, going to just use history.last_epoch_started as lower bound.

Actions #4

Updated by Sage Weil over 9 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Source changed from other to Q/A
Actions #5

Updated by Aaron T over 9 years ago

sjust believes I may have hit the same bug, running 0.80.5. Attached is the log from an OSD with settings:

[osd]
debug osd = 20
debug filestore = 20
debug ms = 1

I'm intending to try building the wip-log-crash-firefly branch to see if I can get any OSDs running again, per sjust's recommendation in IRC.

Actions #6

Updated by Aaron T over 9 years ago

Captured debug log with wip-log-crash-firefly branch and attached.

Actions #7

Updated by Samuel Just over 9 years ago

wip-sam-testing-firefly

Actions #8

Updated by Samuel Just over 9 years ago

I'd need the corresponding logs from osd.5 to be sure, but I believe the problem is that osd.5, due to 9497 and this bug, allowed itself incorrectly to create a new, empty copy of pg 0.9 which is still around even though you've installed the (hopefully) fixed version of the code. You'll need to rename osd.5's copy of 0.9 out of the way and restart osd.5 as well as osd.17. There should be no objects in the pg 0.9 directory on osd.5 if I'm correct.

In case you find other instances of this problem, here is how I worked out that the problem pg is 0.9 on osd 5:

If you go back in the log to the first instance of the assert, you'll find:
2014-09-18 17:53:36.568610 7f65af7fe700 -1 osd/PGLog.cc: In function 'void PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f65af7fe700 time 2014-09-18 17:53:36.566235
osd/PGLog.cc: 544: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

Fromt this, we know that thread 7f65af7fe700 asserted. If we follow that thread back in the log, we find:

2014-09-18 17:53:36.566232 7f65af7fe700 10 merge_log log((0'0,0'0], crt=0'0) from osd.5 into log((8718'164941,9381'168561], crt=9381'168561)

as the last thing it did. Note that it is merging log (0'0,0'0], which is the log for an empty pg, into (8718'164941,9381'168561], which would be a log for a very not empty pg! There is no way the empty log would be authoritative given that we have that very not empty log, hence the assert.

If we go back one more line in that thread's output, we see

2014-09-18 17:53:36.566218 7f65af7fe700 10 osd.17 pg_epoch: 9670 pg[0.9( v 9381'168561 (8718'164941,9381'168561] lb 9ec18e09/100000222a8.00000198/head//0 local-les=0 n=24426 ec=1 les/c 9578/9002 9652/9663/9627) [17,11,5]/[5,11] r=-1 lpr=9663 pi=9001-9662/20 crt=9381'168561 lcod 0'0 remapped NOTIFY] state<Started/Stray>: got info+log from osd.5 0.9( empty local-les=9670 n=0 ec=1 les/c 9578/9002 9652/9663/9627) log((0'0,0'0], crt=0'0)

which tells us that we are dealing with pg 0.9 and a log sent from osd.5.

Note that messing with osd stores is always a bit of risky endeavor. Make sure to stop the osd before messing with the on-disk pgs. Good luck!

Actions #9

Updated by Aaron T over 9 years ago

Samuel Just wrote:

I'd need the corresponding logs from osd.5 to be sure, but I believe the problem is that osd.5, due to 9497 and this bug, allowed itself incorrectly to create a new, empty copy of pg 0.9 which is still around even though you've installed the (hopefully) fixed version of the code. You'll need to rename osd.5's copy of 0.9 out of the way and restart osd.5 as well as osd.17. There should be no objects in the pg 0.9 directory on osd.5 if I'm correct.

In case you find other instances of this problem, here is how I worked out that the problem pg is 0.9 on osd 5:

If you go back in the log to the first instance of the assert, you'll find:
2014-09-18 17:53:36.568610 7f65af7fe700 -1 osd/PGLog.cc: In function 'void PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f65af7fe700 time 2014-09-18 17:53:36.566235
osd/PGLog.cc: 544: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

Fromt this, we know that thread 7f65af7fe700 asserted. If we follow that thread back in the log, we find:

2014-09-18 17:53:36.566232 7f65af7fe700 10 merge_log log((0'0,0'0], crt=0'0) from osd.5 into log((8718'164941,9381'168561], crt=9381'168561)

as the last thing it did. Note that it is merging log (0'0,0'0], which is the log for an empty pg, into (8718'164941,9381'168561], which would be a log for a very not empty pg! There is no way the empty log would be authoritative given that we have that very not empty log, hence the assert.

If we go back one more line in that thread's output, we see

2014-09-18 17:53:36.566218 7f65af7fe700 10 osd.17 pg_epoch: 9670 pg[0.9( v 9381'168561 (8718'164941,9381'168561] lb 9ec18e09/100000222a8.00000198/head//0 local-les=0 n=24426 ec=1 les/c 9578/9002 9652/9663/9627) [17,11,5]/[5,11] r=-1 lpr=9663 pi=9001-9662/20 crt=9381'168561 lcod 0'0 remapped NOTIFY] state<Started/Stray>: got info+log from osd.5 0.9( empty local-les=9670 n=0 ec=1 les/c 9578/9002 9652/9663/9627) log((0'0,0'0], crt=0'0)

which tells us that we are dealing with pg 0.9 and a log sent from osd.5.

Note that messing with osd stores is always a bit of risky endeavor. Make sure to stop the osd before messing with the on-disk pgs. Good luck!

Thanks for the pointers. I tried moving 0.9_head out of the way on OSD 5 and OSD 5 refused to boot, asserting on some check because the dir didn't exist. I put the empty dir back, restarted all the OSDs and the cluster eventually got clean in spite of the crash. (Several assertions and OSD restarts along the way.)

I kept all the logs, but they are all many GiB, so if you need any more to help resolve this bug please let me know and I'll try to provide them. (I don't have osd.5's log with debugging from the time of the event we're discussing, however.)

Actions #10

Updated by Samuel Just over 9 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF