Project

General

Profile

Bug #1502

osd FAILED assert(pg->log.tail <= pg->info.last_complete || pg->log.backlog)

Added by John Leach almost 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
OSD
Target version:
Start date:
09/04/2011
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

I had a 4 osd cluster. I kill -9 one cosd process (as a test) - it was detected as failed and the cluster became degraded and started rebuilding.

I then added a 5th osd node, added it to the cluster - at which point (though it's hard to tell exactly which point) the whole cluster fell apart. I think a couple of cosd processes crashed (I think in some cases due to huge ram inflation and OOM killer). Restarting them gave loads of errors like this:

2011-09-04 18:02:23.916216 7f1d12bc8720 osd2 152 pg[1.45( v 24'57 lc 0'0 (4'55,24'57]+backlog n=47 ec=1 les/c 113/121 149/149/149) [] r=0 (info mismatch, log(4'55,0'0]+backlog) (log bound mismatch, empty) mlcod 0'0 inactive] Got exception 'read_log_error: Could not find hash for hoid 20000000465.00000000/head
' while reading log. Moving corrupted log file to 'corrupt_log_2011-09-04_18:02_1.45' for later analysis.
2011-09-04 18:02:23.968465 7f1d12bc8720 osd2 152 pg[0.6f( v 4'803 lc 0'0 (4'801,4'803]+backlog n=801 ec=1 les/c 128/53 149/149/54) [] r=0 (info mismatch, log(4'801,0'0]+backlog) (log bound mismatch, empty) mlcod 0'0 inactive] Got exception 'read_log_error: Could not find hash for hoid 10000000024.00000000/head

(these servers were not rebooted, so no unflushed written data should have got lost).

attached a log from one of the osds that crashed - no evidence that this one OOmed, but two others had processes killed by the OOM killer for sure.

This server now crashes with the fails assert every time I start it now - looks like the data is corrupt on disk now (but again, no evidence of filesystem corruption here).

the crashes occurred with packages built from git commit e8b12d80b5, but I've since upgraded to commit 933e7945a.

osd.2.log.gz (1.14 MB) John Leach, 09/04/2011 11:26 AM

Associated revisions

Revision 405abf5a (diff)
Added by Samuel Just almost 6 years ago

PG: generate backlog when confronted with corrupt log

Currently we throw out the log and start up anyway. With this change, we
would throw out the log, generate a fresh backlog, and then start up.
That may not be the best possible thing, but it's better than what we
currently do. Indirectly fixes #1502.

Signed-off-by: Samuel Just <>

History

#1 Updated by Sage Weil almost 6 years ago

  • Priority changed from Normal to High
  • Target version set to v0.36

#2 Updated by Greg Farnum almost 6 years ago

Following master has been more dangerous than normal lately due to the on-disk format change coming up. :/ But I expect Sam will know what to do with this on Tuesday (Labor Day for us tomorrow).

#3 Updated by Sage Weil almost 6 years ago

  • translation missing: en.field_position set to 2

#4 Updated by Sage Weil almost 6 years ago

  • Status changed from New to Resolved
  • translation missing: en.field_position deleted (22)
  • translation missing: en.field_position set to 22

Also available in: Atom PDF