Project

General

Profile

Actions

Bug #3689

closed

osd: bad peering state machine event with mixed v0.52 and next cluster

Added by Sage Weil over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Reported by mgalkiewicz in #ceph. https://gist.github.com/raw/4393494/f3ae88406350b74ac6d608b8b75960f85435e85e/gistfile1.txt is the crash when running v0.55; same result on next.

Actions #1

Updated by Maciej Galkiewicz over 11 years ago

Log from crashing osd with greater debug level https://dl.dropbox.com/u/5820195/ceph-osd.1.log.gz.

Actions #2

Updated by Sage Weil over 11 years ago

This looks like a compatibility issue with recovery queueing:

2012-12-28 02:20:34.799511 7f1da135c700 20 osd.1 pg_epoch: 604 pg[167.2( v 558'1628 (547'628,558'1628] lb 48d305c2/rb.0.2acc.5d4adc0a.00000000027d/head//167 local-les=589 n=2 ec=480 les/c 600/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 lcod 0'0 remapped NOTIFY] enter Started/ReplicaActive
2012-12-28 02:20:34.799523 7f1da135c700 20 osd.1 pg_epoch: 604 pg[167.2( v 558'1628 (547'628,558'1628] lb 48d305c2/rb.0.2acc.5d4adc0a.00000000027d/head//167 local-les=589 n=2 ec=480 les/c 600/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 lcod 0'0 remapped NOTIFY] enter Started/ReplicaActive/RepNotRecovering
2012-12-28 02:20:34.799536 7f1da135c700 10 osd.1 pg_epoch: 604 pg[167.2( v 558'1628 (547'628,558'1628] lb 48d305c2/rb.0.2acc.5d4adc0a.00000000027d/head//167 local-les=589 n=2 ec=480 les/c 600/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 lcod 0'0 remapped NOTIFY] state<Started/ReplicaActive>: In ReplicaActive, about to call activate

...

2012-12-28 02:20:50.775168 7f1da135c700 10 osd.1 pg_epoch: 606 pg[167.2( v 558'1628 (547'628,558'1628] lb f45097ea/rb.0.2acc.5d4adc0a.00000000009f/head//167 local-les=604 n=4 ec=480 les/c 604/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 luod=0'0 active+remapped] do_backfill pg_backfill(finish 167.2 e 606/606 lb MAX) v1
2012-12-28 02:20:50.775218 7f1da135c700 20 osd.1 pg_epoch: 606 pg[167.2( v 558'1628 (547'628,558'1628] local-les=604 n=5 ec=480 les/c 604/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 luod=0'0 active+remapped] write_info bigbl 1398
2012-12-28 02:20:50.775259 7f1da135c700 10 osd.1 pg_epoch: 606 pg[167.2( v 558'1628 (547'628,558'1628] local-les=604 n=5 ec=480 les/c 604/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 luod=0'0 active+remapped] handle_peering_event: epoch_sent: 606 epoch_requested: 606 RecoveryDone
2012-12-28 02:20:50.775273 7f1da135c700 20 osd.1 pg_epoch: 606 pg[167.2( v 558'1628 (547'628,558'1628] local-les=604 n=5 ec=480 les/c 604/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 luod=0'0 active+remapped] exit Started/ReplicaActive/RepNotRecovering 15.975750 8 0.000437
2012-12-28 02:20:50.775286 7f1da135c700 20 osd.1 pg_epoch: 606 pg[167.2( v 558'1628 (547'628,558'1628] local-les=604 n=5 ec=480 les/c 604/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 luod=0'0 active+remapped] exit Started/ReplicaActive 15.975775 0 0.000000
2012-12-28 02:20:50.775296 7f1da135c700 20 osd.1 pg_epoch: 606 pg[167.2( v 558'1628 (547'628,558'1628] local-les=604 n=5 ec=480 les/c 604/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 luod=0'0 active+remapped] exit Started 23.020514 0 0.000000
2012-12-28 02:20:50.775304 7f1da135c700 20 osd.1 pg_epoch: 606 pg[167.2( v 558'1628 (547'628,558'1628] local-les=604 n=5 ec=480 les/c 604/600 601/601/527) [2,1]/[2,1,0] r=1 lpr=601 pi=480-600/22 luod=0'0 active+remapped] enter Crashed

presumably because the primary (v0.52) isn't doing the recovery reservation and twiddling the replica's state machine properly.
Actions #3

Updated by Sage Weil over 11 years ago

  • Assignee set to Sage Weil
Actions #4

Updated by Sage Weil over 11 years ago

  • Status changed from 12 to 7

wip-3689 has a fix; please test!

Actions #5

Updated by Sage Weil over 11 years ago

  • Status changed from 7 to Resolved
Actions

Also available in: Atom PDF