osd FAILED assert(pg->log.tail <= pg->info.last_complete || pg->log.backlog)
I had a 4 osd cluster. I kill -9 one cosd process (as a test) - it was detected as failed and the cluster became degraded and started rebuilding.
I then added a 5th osd node, added it to the cluster - at which point (though it's hard to tell exactly which point) the whole cluster fell apart. I think a couple of cosd processes crashed (I think in some cases due to huge ram inflation and OOM killer). Restarting them gave loads of errors like this:
2011-09-04 18:02:23.916216 7f1d12bc8720 osd2 152 pg[1.45( v 24'57 lc 0'0 (4'55,24'57]+backlog n=47 ec=1 les/c 113/121 149/149/149)  r=0 (info mismatch, log(4'55,0'0]+backlog) (log bound mismatch, empty) mlcod 0'0 inactive] Got exception 'read_log_error: Could not find hash for hoid 20000000465.00000000/head ' while reading log. Moving corrupted log file to 'corrupt_log_2011-09-04_18:02_1.45' for later analysis. 2011-09-04 18:02:23.968465 7f1d12bc8720 osd2 152 pg[0.6f( v 4'803 lc 0'0 (4'801,4'803]+backlog n=801 ec=1 les/c 128/53 149/149/54)  r=0 (info mismatch, log(4'801,0'0]+backlog) (log bound mismatch, empty) mlcod 0'0 inactive] Got exception 'read_log_error: Could not find hash for hoid 10000000024.00000000/head
(these servers were not rebooted, so no unflushed written data should have got lost).
attached a log from one of the osds that crashed - no evidence that this one OOmed, but two others had processes killed by the OOM killer for sure.
This server now crashes with the fails assert every time I start it now - looks like the data is corrupt on disk now (but again, no evidence of filesystem corruption here).
the crashes occurred with packages built from git commit e8b12d80b5, but I've since upgraded to commit 933e7945a.
PG: generate backlog when confronted with corrupt log
Currently we throw out the log and start up anyway. With this change, we
would throw out the log, generate a fresh backlog, and then start up.
That may not be the best possible thing, but it's better than what we
currently do. Indirectly fixes #1502.
Signed-off-by: Samuel Just <firstname.lastname@example.org>