Actions
Bug #3689
closedosd: bad peering state machine event with mixed v0.52 and next cluster
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Reported by mgalkiewicz in #ceph. https://gist.github.com/raw/4393494/f3ae88406350b74ac6d608b8b75960f85435e85e/gistfile1.txt is the crash when running v0.55; same result on next.
Updated by Maciej Galkiewicz over 11 years ago
Log from crashing osd with greater debug level https://dl.dropbox.com/u/5820195/ceph-osd.1.log.gz.
Updated by Sage Weil over 11 years ago
This looks like a compatibility issue with recovery queueing:
2012-12-28 02:20:34.799511 7f1da135c700 20 osd.1 pg_epoch: 604 pg[167.2( v 558'1628 (547'628,558'1628] lb 48d305c2/rb.0.2acc.5d4adc0a.00000000027d/head//167 local-les=589 n=2 ec=480 les/c 600/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 lcod 0'0 remapped NOTIFY] enter Started/ReplicaActive 2012-12-28 02:20:34.799523 7f1da135c700 20 osd.1 pg_epoch: 604 pg[167.2( v 558'1628 (547'628,558'1628] lb 48d305c2/rb.0.2acc.5d4adc0a.00000000027d/head//167 local-les=589 n=2 ec=480 les/c 600/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 lcod 0'0 remapped NOTIFY] enter Started/ReplicaActive/RepNotRecovering 2012-12-28 02:20:34.799536 7f1da135c700 10 osd.1 pg_epoch: 604 pg[167.2( v 558'1628 (547'628,558'1628] lb 48d305c2/rb.0.2acc.5d4adc0a.00000000027d/head//167 local-les=589 n=2 ec=480 les/c 600/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 lcod 0'0 remapped NOTIFY] state<Started/ReplicaActive>: In ReplicaActive, about to call activate ... 2012-12-28 02:20:50.775168 7f1da135c700 10 osd.1 pg_epoch: 606 pg[167.2( v 558'1628 (547'628,558'1628] lb f45097ea/rb.0.2acc.5d4adc0a.00000000009f/head//167 local-les=604 n=4 ec=480 les/c 604/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 luod=0'0 active+remapped] do_backfill pg_backfill(finish 167.2 e 606/606 lb MAX) v1 2012-12-28 02:20:50.775218 7f1da135c700 20 osd.1 pg_epoch: 606 pg[167.2( v 558'1628 (547'628,558'1628] local-les=604 n=5 ec=480 les/c 604/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 luod=0'0 active+remapped] write_info bigbl 1398 2012-12-28 02:20:50.775259 7f1da135c700 10 osd.1 pg_epoch: 606 pg[167.2( v 558'1628 (547'628,558'1628] local-les=604 n=5 ec=480 les/c 604/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 luod=0'0 active+remapped] handle_peering_event: epoch_sent: 606 epoch_requested: 606 RecoveryDone 2012-12-28 02:20:50.775273 7f1da135c700 20 osd.1 pg_epoch: 606 pg[167.2( v 558'1628 (547'628,558'1628] local-les=604 n=5 ec=480 les/c 604/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 luod=0'0 active+remapped] exit Started/ReplicaActive/RepNotRecovering 15.975750 8 0.000437 2012-12-28 02:20:50.775286 7f1da135c700 20 osd.1 pg_epoch: 606 pg[167.2( v 558'1628 (547'628,558'1628] local-les=604 n=5 ec=480 les/c 604/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 luod=0'0 active+remapped] exit Started/ReplicaActive 15.975775 0 0.000000 2012-12-28 02:20:50.775296 7f1da135c700 20 osd.1 pg_epoch: 606 pg[167.2( v 558'1628 (547'628,558'1628] local-les=604 n=5 ec=480 les/c 604/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 luod=0'0 active+remapped] exit Started 23.020514 0 0.000000 2012-12-28 02:20:50.775304 7f1da135c700 20 osd.1 pg_epoch: 606 pg[167.2( v 558'1628 (547'628,558'1628] local-les=604 n=5 ec=480 les/c 604/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 luod=0'0 active+remapped] enter Crashed
presumably because the primary (v0.52) isn't doing the recovery reservation and twiddling the replica's state machine properly.
Actions