Bug #2133
osd: recovery_complete
% Done:
0%
Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
pull raced with clones, clone_subset changed, it got confused.
2012-03-02 22:54:21.673190 7f33a9490700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] enqueue_op 0x5d8a200 osd_sub_op(osd.2.0:1364 0.a e2c22d0a/sepia721573-24/74 [push] v 333'131 snapset=0=[]:[] snapc=0=[]) v5 2012-03-02 22:54:21.673369 7f33a6288700 osd.4 641 dequeue_op osd_sub_op(osd.2.0:1364 0.a e2c22d0a/sepia721573-24/74 [push] v 333'131 snapset=0=[]:[] snapc=0=[]) v5 pg pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] 2012-03-02 22:54:21.673425 7f33a6288700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] do_sub_op osd_sub_op(osd.2.0:1364 0.a e2c22d0a/sepia721573-24/74 [push] v 333'131 snapset=0=[]:[] snapc=0=[]) v5 2012-03-02 22:54:21.673473 7f33a6288700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] handle_pull_responseObjectRecoveryInfo(e2c22d0a/sepia721573-24/74@333'131, copy_subset: [0~3274887], clone_subset: {})ObjectRecoveryProgress(!first, data_recovered_to:3145728, data_complete:false, omap_recovered_to:, omap_complete:true) data.size() is 1048576 data_included: [2097152~1048576] 2012-03-02 22:54:21.673509 7f33a6288700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] get_snapset_context sepia721573-24 2 -> 3 2012-03-02 22:54:21.673544 7f33a6288700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] calc_clone_subsets e2c22d0a/sepia721573-24/74 clone_overlap {74=[56~515336,1176123~2098764],88=[56~431837,1041259~769636,2503672~574024,3231893~42994],db=[]} 2012-03-02 22:54:21.673581 7f33a6288700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] calc_clone_subsets e2c22d0a/sepia721573-24/74 has next e2c22d0a/sepia721573-24/88 overlap [56~515336,1176123~2098764] 2012-03-02 22:54:21.673621 7f33a6288700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] calc_clone_subsets e2c22d0a/sepia721573-24/74 data_subset [0~56,515392~660731] clone_subsets {e2c22d0a/sepia721573-24/88=[56~515336,1176123~2098764]} 2012-03-02 22:54:21.673650 7f33a6288700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] put_snapset_context sepia721573-24 3 -> 2 2012-03-02 22:54:21.673697 7f33a6288700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] new recovery_info ObjectRecoveryInfo(e2c22d0a/sepia721573-24/74@333'131, copy_subset: [0~56,515392~660731], clone_subset: {e2c22d0a/sepia721573-24/88=[56~515336,1176123~2098764]}), new progress ObjectRecoveryProgress(!first, data_recovered_to:3145728, data_complete:false, omap_recovered_to:, omap_complete:true) 2012-03-02 22:54:21.673725 7f33a6288700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] uh oh, we reached EOF on peer before we got everything we wanted 2012-03-02 22:54:21.673756 7f33a6288700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] _failed_push e2c22d0a/sepia721573-24/74 from osd.2, reps on 2,3 2012-03-02 22:54:21.673789 7f33a6288700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] finish_recovery_op e2c22d0a/sepia721573-24/74 2012-03-02 22:54:21.673829 7f33a6288700 osd.4 641 finish_recovery_op pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=4 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] e2c22d0a/sepia721573-24/74 dequeue=0 (5/5 rops)
I think the pi.recovery_progress.data_complete (from the peer) is useless here, since the peer didn't know what we have. which makes this check pointless:
bool complete = pi.is_complete(); if (complete && !pi.recovery_progress.data_complete) { dout(0) << " uh oh, we reached EOF on peer before we got everything we wanted" << dendl; _failed_push(op); return; }
just removing it..
Related issues
Associated revisions
osd: don't trust pusher's data_complete
The pusher doesn't know what clone_overlap we'll see, so it has no idea
if we are data_complete from our perspective, making this check useless.
In particular, we screw up if we race with a recalculation of
clone_overlap.
Fixes: #2133
Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
History
#1 Updated by Sage Weil about 12 years ago
- Status changed from 12 to 4
#2 Updated by Sage Weil about 12 years ago
- Status changed from 4 to Resolved