Bug #3689: osd: bad peering state machine event with mixed v0.52 and next cluster - Ceph - Ceph

Actions

Copy link

Bug #3689

closed

osd: bad peering state machine event with mixed v0.52 and next cluster

Added by Sage Weil over 11 years ago. Updated over 11 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Sage Weil

Category:

OSD

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Reported by mgalkiewicz in #ceph. https://gist.github.com/raw/4393494/f3ae88406350b74ac6d608b8b75960f85435e85e/gistfile1.txt is the crash when running v0.55; same result on next.

Actions

Copy link

Updated by Maciej Galkiewicz over 11 years ago

Log from crashing osd with greater debug level https://dl.dropbox.com/u/5820195/ceph-osd.1.log.gz.

Actions

Copy link

Updated by Sage Weil over 11 years ago

This looks like a compatibility issue with recovery queueing:

2012-12-28 02:20:34.799511 7f1da135c700 20 osd.1 pg_epoch: 604 pg[167.2( v 558'1628 (547'628,558'1628] lb 48d305c2/rb.0.2acc.5d4adc0a.00000000027d/head//167 local-les=589 n=2 ec=480 les/c 600/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 lcod 0'0 remapped NOTIFY] enter Started/ReplicaActive
2012-12-28 02:20:34.799523 7f1da135c700 20 osd.1 pg_epoch: 604 pg[167.2( v 558'1628 (547'628,558'1628] lb 48d305c2/rb.0.2acc.5d4adc0a.00000000027d/head//167 local-les=589 n=2 ec=480 les/c 600/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 lcod 0'0 remapped NOTIFY] enter Started/ReplicaActive/RepNotRecovering
2012-12-28 02:20:34.799536 7f1da135c700 10 osd.1 pg_epoch: 604 pg[167.2( v 558'1628 (547'628,558'1628] lb 48d305c2/rb.0.2acc.5d4adc0a.00000000027d/head//167 local-les=589 n=2 ec=480 les/c 600/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 lcod 0'0 remapped NOTIFY] state<Started/ReplicaActive>: In ReplicaActive, about to call activate

...

2012-12-28 02:20:50.775168 7f1da135c700 10 osd.1 pg_epoch: 606 pg[167.2( v 558'1628 (547'628,558'1628] lb f45097ea/rb.0.2acc.5d4adc0a.00000000009f/head//167 local-les=604 n=4 ec=480 les/c 604/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 luod=0'0 active+remapped] do_backfill pg_backfill(finish 167.2 e 606/606 lb MAX) v1
2012-12-28 02:20:50.775218 7f1da135c700 20 osd.1 pg_epoch: 606 pg[167.2( v 558'1628 (547'628,558'1628] local-les=604 n=5 ec=480 les/c 604/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 luod=0'0 active+remapped] write_info bigbl 1398
2012-12-28 02:20:50.775259 7f1da135c700 10 osd.1 pg_epoch: 606 pg[167.2( v 558'1628 (547'628,558'1628] local-les=604 n=5 ec=480 les/c 604/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 luod=0'0 active+remapped] handle_peering_event: epoch_sent: 606 epoch_requested: 606 RecoveryDone
2012-12-28 02:20:50.775273 7f1da135c700 20 osd.1 pg_epoch: 606 pg[167.2( v 558'1628 (547'628,558'1628] local-les=604 n=5 ec=480 les/c 604/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 luod=0'0 active+remapped] exit Started/ReplicaActive/RepNotRecovering 15.975750 8 0.000437
2012-12-28 02:20:50.775286 7f1da135c700 20 osd.1 pg_epoch: 606 pg[167.2( v 558'1628 (547'628,558'1628] local-les=604 n=5 ec=480 les/c 604/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 luod=0'0 active+remapped] exit Started/ReplicaActive 15.975775 0 0.000000
2012-12-28 02:20:50.775296 7f1da135c700 20 osd.1 pg_epoch: 606 pg[167.2( v 558'1628 (547'628,558'1628] local-les=604 n=5 ec=480 les/c 604/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 luod=0'0 active+remapped] exit Started 23.020514 0 0.000000
2012-12-28 02:20:50.775304 7f1da135c700 20 osd.1 pg_epoch: 606 pg[167.2( v 558'1628 (547'628,558'1628] local-les=604 n=5 ec=480 les/c 604/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 luod=0'0 active+remapped] enter Crashed

presumably because the primary (v0.52) isn't doing the recovery reservation and twiddling the replica's state machine properly.

Actions

Copy link