Project

General

Profile

Bug #2133

osd: recovery_complete

Added by Sage Weil about 12 years ago. Updated about 12 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

pull raced with clones, clone_subset changed, it got confused.

2012-03-02 22:54:21.673190 7f33a9490700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] enqueue_op 0x5d8a200 osd_sub_op(osd.2.0:1364 0.a e2c22d0a/sepia721573-24/74 [push] v 333'131 snapset=0=[]:[] snapc=0=[]) v5
2012-03-02 22:54:21.673369 7f33a6288700 osd.4 641 dequeue_op osd_sub_op(osd.2.0:1364 0.a e2c22d0a/sepia721573-24/74 [push] v 333'131 snapset=0=[]:[] snapc=0=[]) v5 pg pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]]
2012-03-02 22:54:21.673425 7f33a6288700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] do_sub_op osd_sub_op(osd.2.0:1364 0.a e2c22d0a/sepia721573-24/74 [push] v 333'131 snapset=0=[]:[] snapc=0=[]) v5
2012-03-02 22:54:21.673473 7f33a6288700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] handle_pull_responseObjectRecoveryInfo(e2c22d0a/sepia721573-24/74@333'131, copy_subset: [0~3274887], clone_subset: {})ObjectRecoveryProgress(!first, data_recovered_to:3145728, data_complete:false, omap_recovered_to:, omap_complete:true) data.size() is 1048576 data_included: [2097152~1048576]
2012-03-02 22:54:21.673509 7f33a6288700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] get_snapset_context sepia721573-24 2 -> 3
2012-03-02 22:54:21.673544 7f33a6288700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] calc_clone_subsets e2c22d0a/sepia721573-24/74 clone_overlap {74=[56~515336,1176123~2098764],88=[56~431837,1041259~769636,2503672~574024,3231893~42994],db=[]}
2012-03-02 22:54:21.673581 7f33a6288700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] calc_clone_subsets e2c22d0a/sepia721573-24/74 has next e2c22d0a/sepia721573-24/88 overlap [56~515336,1176123~2098764]
2012-03-02 22:54:21.673621 7f33a6288700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] calc_clone_subsets e2c22d0a/sepia721573-24/74  data_subset [0~56,515392~660731]  clone_subsets {e2c22d0a/sepia721573-24/88=[56~515336,1176123~2098764]}
2012-03-02 22:54:21.673650 7f33a6288700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] put_snapset_context sepia721573-24 3 -> 2
2012-03-02 22:54:21.673697 7f33a6288700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] new recovery_info ObjectRecoveryInfo(e2c22d0a/sepia721573-24/74@333'131, copy_subset: [0~56,515392~660731], clone_subset: {e2c22d0a/sepia721573-24/88=[56~515336,1176123~2098764]}), new progress ObjectRecoveryProgress(!first, data_recovered_to:3145728, data_complete:false, omap_recovered_to:, omap_complete:true)
2012-03-02 22:54:21.673725 7f33a6288700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]]  uh oh, we reached EOF on peer before we got everything we wanted
2012-03-02 22:54:21.673756 7f33a6288700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] _failed_push e2c22d0a/sepia721573-24/74 from osd.2, reps on 2,3
2012-03-02 22:54:21.673789 7f33a6288700 osd.4 641 pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=5 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] finish_recovery_op e2c22d0a/sepia721573-24/74
2012-03-02 22:54:21.673829 7f33a6288700 osd.4 641 finish_recovery_op pg[0.a( v 633'215 lc 5'2 (0'0,633'215] n=19 ec=1 les/c 635/633 636/636/634) [4,0] r=0 lpr=636 rops=4 lcod 0'0 mlcod 0'0 active+recovering m=12 snaptrimq=[1~ad,af~19,c9~3,cd~2,d0~9,db~4,e1~4,e6~1,e8~2,eb~5,f1~1,f3~2,f8~1,fd~1]] e2c22d0a/sepia721573-24/74 dequeue=0 (5/5 rops)

I think the pi.recovery_progress.data_complete (from the peer) is useless here, since the peer didn't know what we have. which makes this check pointless:
  bool complete = pi.is_complete();

  if (complete && !pi.recovery_progress.data_complete) {
    dout(0) << " uh oh, we reached EOF on peer before we got everything we wanted" 
        << dendl;
    _failed_push(op);
    return;
  }

just removing it..


Related issues

Duplicated by Ceph - Bug #2126: osd: recover_primary did nothing when num_missing==1 Duplicate 03/02/2012

Associated revisions

Revision 743da9bd (diff)
Added by Sage Weil about 12 years ago

osd: don't trust pusher's data_complete

The pusher doesn't know what clone_overlap we'll see, so it has no idea
if we are data_complete from our perspective, making this check useless.
In particular, we screw up if we race with a recalculation of
clone_overlap.

Fixes: #2133
Signed-off-by: Sage Weil <>
Reviewed-by: Samuel Just <>

History

#1 Updated by Sage Weil about 12 years ago

  • Status changed from 12 to 4

#2 Updated by Sage Weil about 12 years ago

  • Status changed from 4 to Resolved

Also available in: Atom PDF