Bug #9482: osd/PGLog.cc: 544: FAILED assert(log.head >= olog.tail && olog.head >= log.tail) - Ceph - Ceph

Actions

Copy link

Bug #9482

closed

osd/PGLog.cc: 544: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

Added by Samuel Just over 9 years ago. Updated over 9 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Samuel Just

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

backfill peers are not setting info.last_epoch_started allowing subsequent primaries to erroneously conclude that it is 0.

Files

Download all files

ceph-osd.17.log.bz2 (13.8 MB) ceph-osd.17.log.bz2		Aaron T, 09/18/2014 01:18 PM
ceph-osd.17.log.2014_09_18-crash_with_wip-log-crash-firefly.bz2 (57.4 MB) ceph-osd.17.log.2014_09_18-crash_with_wip-log-crash-firefly.bz2		Aaron T, 09/18/2014 08:42 PM

Actions

Copy link

Updated by Samuel Just over 9 years ago

Priority changed from Normal to Urgent

Actually, I'm not sure that's right. Thinking.

Actions

Copy link

Updated by Samuel Just over 9 years ago

Yeah, pretty sure that's right, even if we only find backfill peers, we want to let those determine the min acceptable last update.

Actions

Copy link

Updated by Samuel Just over 9 years ago

Status changed from New to Fix Under Review
Assignee set to Samuel Just

Can't set info.last_epoch_started there, going to just use history.last_epoch_started as lower bound.

Actions

Copy link

Updated by Sage Weil over 9 years ago

Status changed from Fix Under Review to Pending Backport
Source changed from other to Q/A

Actions

Copy link

Updated by Aaron T over 9 years ago

File ceph-osd.17.log.bz2 ceph-osd.17.log.bz2 added

sjust believes I may have hit the same bug, running 0.80.5. Attached is the log from an OSD with settings:

[osd]
debug osd = 20
debug filestore = 20
debug ms = 1

I'm intending to try building the wip-log-crash-firefly branch to see if I can get any OSDs running again, per sjust's recommendation in IRC.

Actions

Copy link

Updated by Aaron T over 9 years ago

File ceph-osd.17.log.2014_09_18-crash_with_wip-log-crash-firefly.bz2 ceph-osd.17.log.2014_09_18-crash_with_wip-log-crash-firefly.bz2 added

Captured debug log with wip-log-crash-firefly branch and attached.

Actions

Copy link

Updated by Samuel Just over 9 years ago

wip-sam-testing-firefly

Actions

Copy link

Updated by Samuel Just over 9 years ago

I'd need the corresponding logs from osd.5 to be sure, but I believe the problem is that osd.5, due to 9497 and this bug, allowed itself incorrectly to create a new, empty copy of pg 0.9 which is still around even though you've installed the (hopefully) fixed version of the code. You'll need to rename osd.5's copy of 0.9 out of the way and restart osd.5 as well as osd.17. There should be no objects in the pg 0.9 directory on osd.5 if I'm correct.

In case you find other instances of this problem, here is how I worked out that the problem pg is 0.9 on osd 5:

If you go back in the log to the first instance of the assert, you'll find:
2014-09-18 17:53:36.568610 7f65af7fe700 -1 osd/PGLog.cc: In function 'void PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f65af7fe700 time 2014-09-18 17:53:36.566235
osd/PGLog.cc: 544: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

Fromt this, we know that thread 7f65af7fe700 asserted. If we follow that thread back in the log, we find:

2014-09-18 17:53:36.566232 7f65af7fe700 10 merge_log log((0'0,0'0], crt=0'0) from osd.5 into log((8718'164941,9381'168561], crt=9381'168561)

as the last thing it did. Note that it is merging log (0'0,0'0], which is the log for an empty pg, into (8718'164941,9381'168561], which would be a log for a very not empty pg! There is no way the empty log would be authoritative given that we have that very not empty log, hence the assert.

If we go back one more line in that thread's output, we see

2014-09-18 17:53:36.566218 7f65af7fe700 10 osd.17 pg_epoch: 9670 pg[0.9( v 9381'168561 (8718'164941,9381'168561] lb 9ec18e09/100000222a8.00000198/head//0 local-les=0 n=24426 ec=1 les/c 9578/9002 9652/9663/9627) [17,11,5]/[5,11] r=-1 lpr=9663 pi=9001-9662/20 crt=9381'168561 lcod 0'0 remapped NOTIFY] state<Started/Stray>: got info+log from osd.5 0.9( empty local-les=9670 n=0 ec=1 les/c 9578/9002 9652/9663/9627) log((0'0,0'0], crt=0'0)

which tells us that we are dealing with pg 0.9 and a log sent from osd.5.

Note that messing with osd stores is always a bit of risky endeavor. Make sure to stop the osd before messing with the on-disk pgs. Good luck!

Actions

Copy link

Updated by Aaron T over 9 years ago

Samuel Just wrote:

I'd need the corresponding logs from osd.5 to be sure, but I believe the problem is that osd.5, due to 9497 and this bug, allowed itself incorrectly to create a new, empty copy of pg 0.9 which is still around even though you've installed the (hopefully) fixed version of the code. You'll need to rename osd.5's copy of 0.9 out of the way and restart osd.5 as well as osd.17. There should be no objects in the pg 0.9 directory on osd.5 if I'm correct.

In case you find other instances of this problem, here is how I worked out that the problem pg is 0.9 on osd 5:

If you go back in the log to the first instance of the assert, you'll find:
2014-09-18 17:53:36.568610 7f65af7fe700 -1 osd/PGLog.cc: In function 'void PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f65af7fe700 time 2014-09-18 17:53:36.566235
osd/PGLog.cc: 544: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

Fromt this, we know that thread 7f65af7fe700 asserted. If we follow that thread back in the log, we find:

2014-09-18 17:53:36.566232 7f65af7fe700 10 merge_log log((0'0,0'0], crt=0'0) from osd.5 into log((8718'164941,9381'168561], crt=9381'168561)

as the last thing it did. Note that it is merging log (0'0,0'0], which is the log for an empty pg, into (8718'164941,9381'168561], which would be a log for a very not empty pg! There is no way the empty log would be authoritative given that we have that very not empty log, hence the assert.

If we go back one more line in that thread's output, we see

2014-09-18 17:53:36.566218 7f65af7fe700 10 osd.17 pg_epoch: 9670 pg[0.9( v 9381'168561 (8718'164941,9381'168561] lb 9ec18e09/100000222a8.00000198/head//0 local-les=0 n=24426 ec=1 les/c 9578/9002 9652/9663/9627) [17,11,5]/[5,11] r=-1 lpr=9663 pi=9001-9662/20 crt=9381'168561 lcod 0'0 remapped NOTIFY] state<Started/Stray>: got info+log from osd.5 0.9( empty local-les=9670 n=0 ec=1 les/c 9578/9002 9652/9663/9627) log((0'0,0'0], crt=0'0)

which tells us that we are dealing with pg 0.9 and a log sent from osd.5.

Note that messing with osd stores is always a bit of risky endeavor. Make sure to stop the osd before messing with the on-disk pgs. Good luck!

Thanks for the pointers. I tried moving 0.9_head out of the way on OSD 5 and OSD 5 refused to boot, asserting on some check because the dir didn't exist. I put the empty dir back, restarted all the OSDs and the cluster eventually got clean in spite of the crash. (Several assertions and OSD restarts along the way.)

I kept all the logs, but they are all many GiB, so if you need any more to help resolve this bug please let me know and I'll try to provide them. (I don't have osd.5's log with debugging from the time of the event we're discussing, however.)

Actions

Copy link

#10

Updated by Samuel Just over 9 years ago

Status changed from Pending Backport to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #9482

osd/PGLog.cc: 544: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

Updated by Samuel Just over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Sage Weil over 9 years ago

Updated by Aaron T over 9 years ago

Updated by Aaron T over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Aaron T over 9 years ago

Updated by Samuel Just over 9 years ago