Project

General

Profile

Bug #36411

OSD crash starting recovery/backfill with EC pool

Added by Graham Allan 4 months ago. Updated 4 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
EC Pools
Target version:
Start date:
10/10/2018
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:

Description

We have one pg on a 4+2 EC pool in which the OSDs will crash with the following error, on reaching an active set of min_size.

(extract with osd debug level 20)

    -5> 2018-10-09 16:37:42.240012 7fd40984f700 20 osd.190 pg_epoch: 710815 pg[70.82ds2( v 710815'704747 (586066'698574,710815'704747] local-lis/les=710814/710815 n=102931 ec=21494/21494 lis/c 710814/588565 les/c/f 710815/588566/0 710711/710814/710814) [820,761,105,789,562,485]/[2147483647,2147483647,190,448,61,315]p190(2) r=2 lpr=710814 pi=[588565,710814)/39 luod=710799'704745 rops=1 bft=105(2),485(5),562(4),761(1),789(3),8
20(0) crt=710799'704745 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply Complete: ReadOp(tid=15, to_read={70:b415ca12:::default.10341087.1__shadow_Subjects%2fsub-NDARINV6BVVAY29%2fses-baselineYear1Arm1%2ffunc%2frun-04_AROMA%2fdenoised_func_data_nonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16_3:head=read_request_t(to_read=[0,8388608,0], need=61(4),190(2),315(5),448(3), want_attrs=0)}
, complete={70:b415ca12:::default.10341087.1__shadow_Subjects%2fsub-NDARINV6BVVAY29%2fses-baselineYear1Arm1%2ffunc%2frun-04_AROMA%2fdenoised_func_data_nonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16_3:head=read_result_t(r=0, errors={}, noattrs, returned=(0, 8388608, [61(4),786432, 190(2),786432, 315(5),786432, 448(3),786432]))}, priority=1, obj_to_source={70:b415ca12:::default.10341087.1__shadow_Subjects%2fsub-NDARINV6BVVAY
29%2fses-baselineYear1Arm1%2ffunc%2frun-04_AROMA%2fdenoised_func_data_nonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16_3:head=61(4),190(2),315(5),448(3)}, source_to_obj={61(4)=70:b415ca12:::default.10341087.1__shadow_Subjects%2fsub-NDARINV6BVVAY29%2fses-baselineYear1Arm1%2ffunc%2frun-04_AROMA%2fdenoised_func_data_nonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16_3:head,190(2)=70:b415ca12:::default.10341087.1__shadow_Subjects%
2fsub-NDARINV6BVVAY29%2fses-baselineYear1Arm1%2ffunc%2frun-04_AROMA%2fdenoised_func_data_nonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16_3:head,315(5)=70:b415ca12:::default.10341087.1__shadow_Subjects%2fsub-NDARINV6BVVAY29%2fses-baselineYear1Arm1%2ffunc%2frun-04_AROMA%2fdenoised_func_data_nonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16_3:head,448(3)=70:b415ca12:::default.10341087.1__shadow_Subjects%2fsub-NDARINV6BVVAY29%2f
ses-baselineYear1Arm1%2ffunc%2frun-04_AROMA%2fdenoised_func_data_nonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16_3:head}, in_progress=)
    -4> 2018-10-09 16:37:42.240048 7fd40984f700 10 osd.190 pg_epoch: 710815 pg[70.82ds2( v 710815'704747 (586066'698574,710815'704747] local-lis/les=710814/710815 n=102931 ec=21494/21494 lis/c 710814/588565 les/c/f 710815/588566/0 710711/710814/710814) [820,761,105,789,562,485]/[2147483647,2147483647,190,448,61,315]p190(2) r=2 lpr=710814 pi=[588565,710814)/39 luod=710799'704745 rops=1 bft=105(2),485(5),562(4),761(1),789(3),8
20(0) crt=710799'704745 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling] handle_recovery_read_complete: returned 70:b415ca12:::default.10341087.1__shadow_Subjects%2fsub-NDARINV6BVVAY29%2fses-baselineYear1Arm1%2ffunc%2frun-04_AROMA%2fdenoised_func_data_nonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16_3:head (0, 8388608, [61(4),786432, 190(2),786432, 315(5),786432, 448(3),786432])
    -3> 2018-10-09 16:37:42.240074 7fd40984f700 10 osd.190 pg_epoch: 710815 pg[70.82ds2( v 710815'704747 (586066'698574,710815'704747] local-lis/les=710814/710815 n=102931 ec=21494/21494 lis/c 710814/588565 les/c/f 710815/588566/0 710711/710814/710814) [820,761,105,789,562,485]/[2147483647,2147483647,190,448,61,315]p190(2) r=2 lpr=710814 pi=[588565,710814)/39 luod=710799'704745 rops=1 bft=105(2),485(5),562(4),761(1),789(3),8
20(0) crt=710799'704745 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling] handle_recovery_read_complete: [2,786432, 3,786432, 4,786432, 5,786432]
    -2> 2018-10-09 16:37:42.245676 7fd40984f700 10 osd.190 pg_epoch: 710815 pg[70.82ds2( v 710815'704747 (586066'698574,710815'704747] local-lis/les=710814/710815 n=102931 ec=21494/21494 lis/c 710814/588565 les/c/f 710815/588566/0 710711/710814/710814) [820,761,105,789,562,485]/[2147483647,2147483647,190,448,61,315]p190(2) r=2 lpr=710814 pi=[588565,710814)/39 luod=710799'704745 rops=1 bft=105(2),485(5),562(4),761(1),789(3),8
20(0) crt=710799'704745 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling] continue_recovery_op: continuing RecoveryOp(hoid=70:b415ca12:::default.10341087.1__shadow_Subjects%2fsub-NDARINV6BVVAY29%2fses-baselineYear1Arm1%2ffunc%2frun-04_AROMA%2fdenoised_func_data_nonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16_3:head v=579167'695462 missing_on=105(2),485(5),562(4),761(1),789(3),820(0) missing_on_shards=0,1,2
,3,4,5 recovery_info=ObjectRecoveryInfo(70:b415ca12:::default.10341087.1__shadow_Subjects%2fsub-NDARINV6BVVAY29%2fses-baselineYear1Arm1%2ffunc%2frun-04_AROMA%2fdenoised_func_data_nonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16_3:head@579167'695462, size: 0, copy_subset: [], clone_subset: {}, snapset: 0=[]:[]) recovery_progress=ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_c
omplete:true, error:false) obc refcount=3 state=READING waiting_on_pushes= extent_requested=0,8388608)
    -1> 2018-10-09 16:37:42.245787 7fd40984f700 10 osd.190 pg_epoch: 710815 pg[70.82ds2( v 710815'704747 (586066'698574,710815'704747] local-lis/les=710814/710815 n=102931 ec=21494/21494 lis/c 710814/588565 les/c/f 710815/588566/0 710711/710814/710814) [820,761,105,789,562,485]/[2147483647,2147483647,190,448,61,315]p190(2) r=2 lpr=710814 pi=[588565,710814)/39 luod=710799'704745 rops=1 bft=105(2),485(5),562(4),761(1),789(3),8
20(0) crt=710799'704745 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling] continue_recovery_op: before_progress=ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:true, error:false), after_progress=ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, omap_recovered_to:, omap_complete:true, error:false), pop.data.length()=786432, size=
0
     0> 2018-10-09 16:37:42.250431 7fd40984f700 -1 /build/ceph-12.2.8/src/osd/ECBackend.cc: In function 'void ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)' thread 7fd40984f700 time 2018-10-09 16:37:42.245813
/build/ceph-12.2.8/src/osd/ECBackend.cc: 619: FAILED assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovered_to - op.recovery_progress.data_recovered_to))

I've looked for the referenced object on all of the filestore OSDs, and it appears to be present and at least has the correct size on all of them.

It does seem that I can get the pg to stay active without osd crashes if I set nobackfill.

Looking up the referenced "last_backfill" object for each osd in the pg, I can download them via s3/radosgw if I keep the pg alive with nobackfill.

For 2 out of 3 of these, I can't get the etags to match so perhaps there is data corruption.

Still, there must surely be a way to persuade backfill to continue even at the cost of losing these objects?

History

#1 Updated by Graham Allan 4 months ago

This resolved itself, though in a way that doesn't exactly make any sense...

Eventually I noticed that one of the active osds for the pg seemed very slow to respond - not via ceph's slow ops warning; it was when running a 'find' to track down some particular object shards. The drive was clearly failing, however I was never getting any actual i/o errors at the user or osd level (it did trigger a 24-hour deathwatch SMART warning a bit later).

I exported the pg shard from the failing osd, and re-imported it to another otherwise-evacuated osd. This was just for data safety, I never imagined this could be causing the other osds in the pg to crash.

I left the cluster set to nobackfill over the weekend, during which time all osds kept running; this morning I re-enabled backfill, fully expecting this pg to crash again (and then start digging into possibly moving implicated backfill objects out of the way). Instead of which it happily backfilled its way to cleanliness!

I don't think the failing drive's osd was suffering anything beyond slow response times, but it's the only thing which really changed.

#2 Updated by Graham Allan 4 months ago

I have to add to the previous update, which did not explain the resolution of the problem.

The true solution was when, towards the end of the previous week, I identified the s3 object corresponding to the problematic backfill shards, and (again via s3) deleted it.

I did then check the filestore OSDs for several hours and found the object shards still present, so assumed the deletion didn't fully succeed. Apparently though they were later removed by background garbage collection, which is why on re-enabling backfill later, the problem was gone.

#3 Updated by Greg Farnum 4 months ago

  • Status changed from New to Closed

It's my current belief that these objects were broken as a result of intentional metadata manipulation when some of the PG shards were lost with known writes, so while it would be nice to handle that somewhat more smoothly, there are no immediate issues from this case to debug.

Also available in: Atom PDF