Project

General

Profile

Bug #50842

pacific: recovery does not complete because of rw_manager lock not being released

Added by Deepika Upadhyay almost 3 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

recovery of the snapshot should be complete, and proceed to the head object:


2021-05-17T10:38:07.689+0000 7f6f6a617700 10 osd.7 pg_epoch: 139 pg[3.ds0( v 139'278 (0'0,139'278] local-lis/les=135/136 n=5 ec=26/26 lis/c=135/26 les/c/f=136/27/0 sis=135) [7,5,2,6,11,3]/[7,5,NONE,6,11,3]p7(0) async=[2(2)] r=0 lpr=135 pi=[26,135)/1 crt=139'278 lcod 136'276 mlcod 34'17 active+recovering+undersized+degraded+remapped rops=1 mbc={0={(1+0)=4},1={(1+0)=4},2={(0+0)=4},3={(1+0)=4},4={(1+0)=4},5={(1+0)=4}} trimq=[63~1,68~1,6b~1]] _handle_message: MOSDPGPushReply(3.ds0 139/135 [PushReplyOp(3:b037db00:::smithi02839165-33:68)]) v3
2021-05-17T10:38:07.689+0000 7f6f6a617700 10 osd.7 pg_epoch: 139 pg[3.ds0( v 139'278 (0'0,139'278] local-lis/les=135/136 n=5 ec=26/26 lis/c=135/26 les/c/f=136/27/0 sis=135) [7,5,2,6,11,3]/[7,5,NONE,6,11,3]p7(0) async=[2(2)] r=0 lpr=135 pi=[26,135)/1 crt=139'278 lcod 136'276 mlcod 34'17 active+recovering+undersized+degraded+remapped rops=1 mbc={0={(1+0)=4},1={(1+0)=4},2={(0+0)=4},3={(1+0)=4},4={(1+0)=4},5={(1+0)=4}} trimq=[63~1,68~1,6b~1]] continue_recovery_op: continuing RecoveryOp(hoid=3:b037db00:::smithi02839165-33:68 v=136'275 missing_on=2(2) missing_on_shards=2 recovery_info=ObjectRecoveryInfo(3:b037db00:::smithi02839165-33:68@136'275, size: 98304, copy_subset: [], clone_subset: {}, snapset: 68=[]:{68=[68,64,60]}, object_exist: 1) recovery_progress=ObjectRecoveryProgress(!first, data_recovered_to:98304, data_complete:true, omap_recovered_to:, omap_complete:true, error:false) obc refcount=3 state=WRITING waiting_on_pushes= extent_requested=0,1048576)
2021-05-17T10:38:07.689+0000 7f6f6a617700 10 osd.7 pg_epoch: 139 pg[3.ds0( v 139'278 (0'0,139'278] local-lis/les=135/136 n=5 ec=26/26 lis/c=135/26 les/c/f=136/27/0 sis=135) [7,5,2,6,11,3]/[7,5,NONE,6,11,3]p7(0) async=[2(2)] r=0 lpr=135 pi=[26,135)/1 crt=139'278 lcod 136'276 mlcod 34'17 active+recovering+undersized+degraded+remapped rops=1 mbc={0={(1+0)=4},1={(1+0)=4},2={(0+0)=4},3={(1+0)=4},4={(1+0)=4},5={(1+0)=4}} trimq=[63~1,68~1,6b~1]] continue_recovery_op: on_peer_recover on 2(2), obj 3:b037db00:::smithi02839165-33:68
2021-05-17T10:38:07.689+0000 7f6f6a617700 10 osd.7 pg_epoch: 139 pg[3.ds0( v 139'278 (0'0,139'278] local-lis/les=135/136 n=5 ec=26/26 lis/c=135/26 les/c/f=136/27/0 sis=135) [7,5,2,6,11,3]/[7,5,NONE,6,11,3]p7(0) async=[2(2)] r=0 lpr=135 pi=[26,135)/1 crt=139'278 lcod 136'276 mlcod 34'17 active+recovering+undersized+degraded+remapped rops=1 mbc={0={(1+0)=3},1={(1+0)=3},2={(0+0)=3},3={(1+0)=3},4={(1+0)=3},5={(1+0)=3}} trimq=[63~1,68~1,6b~1]] pushed 3:b037db00:::smithi02839165-33:68 to all replicas
2021-05-17T10:38:07.689+0000 7f6f6a617700 10 osd.7 pg_epoch: 139 pg[3.ds0( v 139'278 (0'0,139'278] local-lis/les=135/136 n=5 ec=26/26 lis/c=135/26 les/c/f=136/27/0 sis=135) [7,5,2,6,11,3]/[7,5,NONE,6,11,3]p7(0) async=[2(2)] r=0 lpr=135 pi=[26,135)/1 crt=139'278 lcod 136'276 mlcod 34'17 active+recovering+undersized+degraded+remapped rops=1 mbc={0={(1+0)=3},1={(1+0)=3},2={(0+0)=3},3={(1+0)=3},4={(1+0)=3},5={(1+0)=3}} trimq=[63~1,68~1,6b~1]] finish_recovery_op 3:b037db00:::smithi02839165-33:68
2021-05-17T10:38:07.689+0000 7f6f6a617700 10 osd.7 139 finish_recovery_op pg[3.ds0( v 139'278 (0'0,139'278] local-lis/les=135/136 n=5 ec=26/26 lis/c=135/26 les/c/f=136/27/0 sis=135) [7,5,2,6,11,3]/[7,5,NONE,6,11,3]p7(0) async=[2(2)] r=0 lpr=135 pi=[26,135)/1 crt=139'278 lcod 136'276 mlcod 34'17 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=3},1={(1+0)=3},2={(0+0)=3},3={(1+0)=3},4={(1+0)=3},5={(1+0)=3}} trimq=[63~1,68~1,6b~1]] 3:b037db00:::smithi02839165-33:68 dequeue=0 (1/1000 rops)
2021-05-17T10:38:07.689+0000 7f6f6a617700 10 osd.7 pg_epoch: 139 pg[3.ds0( v 139'278 (0'0,139'278] local-lis/les=135/136 n=5 ec=26/26 lis/c=135/26 les/c/f=136/27/0 sis=135) [7,5,2,6,11,3]/[7,5,NONE,6,11,3]p7(0) async=[2(2)] r=0 lpr=135 pi=[26,135)/1 crt=139'278 lcod 136'276 mlcod 34'17 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=3},1={(1+0)=3},2={(0+0)=3},3={(1+0)=3},4={(1+0)=3},5={(1+0)=3}} trimq=[63~1,68~1,6b~1]] release_backoffs [3:b037db00:::smithi02839165-33:68,3:b037db00:::smithi02839165-33:68)
2021-05-17T10:38:07.689+0000 7f6f6a617700 10 osd.7 pg_epoch: 139 pg[3.ds0( v 139'278 (0'0,139'278] local-lis/les=135/136 n=5 ec=26/26 lis/c=135/26 les/c/f=136/27/0 sis=135) [7,5,2,6,11,3]/[7,5,NONE,6,11,3]p7(0) async=[2(2)] r=0 lpr=135 pi=[26,135)/1 crt=139'278 lcod 136'276 mlcod 34'17 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=3},1={(1+0)=3},2={(0+0)=3},3={(1+0)=3},4={(1+0)=3},5={(1+0)=3}} trimq=[63~1,68~1,6b~1]] finish_degraded_object 3:b037db00:::smithi02839165-33:68
2021-05-17T10:38:07.689+0000 7f6f6a617700 10 osd.7 pg_epoch: 139 pg[3.ds0( v 139'278 (0'0,139'278] local-lis/les=135/136 n=5 ec=26/26 lis/c=135/26 les/c/f=136/27/0 sis=135) [7,5,2,6,11,3]/[7,5,NONE,6,11,3]p7(0) async=[2(2)] r=0 lpr=135 pi=[26,135)/1 crt=139'278 lcod 136'276 mlcod 34'17 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=3},1={(1+0)=3},2={(0+0)=3},3={(1+0)=3},4={(1+0)=3},5={(1+0)=3}} trimq=[63~1,68~1,6b~1]] continue_recovery_op: WRITING return RecoveryOp(hoid=3:b037db00:::smithi02839165-33:68 v=136'275 missing_on=2(2) missing_on_shards=2 recovery_info=ObjectRecoveryInfo(3:b037db00:::smithi02839165-33:68@136'275, size: 98304, copy_subset: [], clone_subset: {}, snapset: 68=[]:{68=[68,64,60]}, object_exist: 1) recovery_progress=ObjectRecoveryProgress(!first, data_recovered_to:98304, data_complete:true, omap_recovered_to:, omap_complete:true, error:false) obc refcount=2 state=COMPLETE waiting_on_pushes= extent_requested=0,1048576)


but the rw lock on the head version is still held:

2021-05-17T10:38:07.689+0000 7f6f6e61f700 10 osd.7 pg_epoch: 139 pg[3.ds0( v 139'278 (0'0,139'278] local-lis/les=135/136 n=5 ec=26/26 lis/c=135/26 les/c/f=136/27/0 sis=135) [7,5,2,6,11,3]/[
7,5,NONE,6,11,3]p7(0) async=[2(2)] r=0 lpr=135 pi=[26,135)/1 crt=139'278 lcod 136'276 mlcod 34'17 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=3},1={(1+0)=3},2={(0+0)=3},3={
(1+0)=3},4={(1+0)=3},5={(1+0)=3}} trimq=[63~1,68~1,6b~1]] get_object_context: 0x5627be07c840 3:b037db00:::smithi02839165-33:head rwstate(write n=1 w=0) oi: 3:b037db00:::smithi02839165-33:he
ad(136'276 client.4495.0:1402 dirty|data_digest s 5406720 uv 174 dd 9f069648 alloc_hint [0 0 0]) exists: 1 ssc: 0x5627c14e1b80 snapset: 68=[]:{68=[68,64,60]}
2021-05-17T10:38:07.689+0000 7f6f6e61f700 20 osd.7 pg_epoch: 139 pg[3.ds0( v 139'278 (0'0,139'278] local-lis/les=135/136 n=5 ec=26/26 lis/c=135/26 les/c/f=136/27/0 sis=135) [7,5,2,6,11,3]/[7,5,NONE,6,11,3]p7(0) async=[2(2)] r=0 lpr=135 pi=[26,135)/1 crt=139'278 lcod 136'276 mlcod 34'17 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=3},1={(1+0)=3},2={(0+0)=3},3={(1+0)=3},4={(1+0)=3},5={(1+0)=3}} trimq=[63~1,68~1,6b~1]] recovery delayed on 3:b037db00:::smithi02839165-33:head; could not get rw_manager lock

http://qa-proxy.ceph.com/teuthology/ideepika-2021-05-17_10:16:28-rados-wip-yuri2-testing-2021-05-11-1032-pacific-distro-basic-smithi/6118165/teuthology.log - no logs

rados/thrash-erasure-code-big/{ceph cluster/{12-osds openstack} mon_election/connectivity msgr-failures/few objectstore/bluestore-hybrid rados recovery-overrides/{default} supported-random-distro$/{centos_8} thrashers/mapgap thrashosds-health workloads/ec-rados-plugin=jerasure-k=4-m=2}


Related issues

Related to RADOS - Bug #51076: "wait_for_recovery: failed before timeout expired" during thrashosd test with EC backend. Resolved
Duplicates RADOS - Bug #51076: "wait_for_recovery: failed before timeout expired" during thrashosd test with EC backend. Resolved

History

#1 Updated by Neha Ojha almost 3 years ago

  • Related to Bug #51076: "wait_for_recovery: failed before timeout expired" during thrashosd test with EC backend. added

#2 Updated by Neha Ojha almost 3 years ago

@Deepika, do you happen to have the logs saved somewhere?

#3 Updated by Neha Ojha almost 3 years ago

  • Description updated (diff)

#4 Updated by Deepika Upadhyay almost 3 years ago

@Neha: I did, but am afraid they are lost, the test was from https://pulpito.ceph.com/ideepika-2021-05-17_10:16:28-rados-wip-yuri2-testing-2021-05-11-1032-pacific-distro-basic-smithi/

this was caught when dispatch delay param was introduced: https://github.com/ceph/ceph/pull/41136 adjusting them, might help with reproducing maybe.

#5 Updated by Neha Ojha almost 3 years ago

  • Status changed from New to Need More Info

#6 Updated by Laura Flores about 2 years ago

I recovered logs from a scenario that looks very similar.

See the full result of `zcat /a/yuriw-2022-02-17_22:49:55-rados-wip-yuri3-testing-2022-02-17-1256-distro-default-smithi/6692376/remote/smithi018/log/ceph-osd.0.log.gz | grep "pg\[3.6s0( v"` on the teuthology node in /home/lflores/pg3-6.txt.

/a/yuriw-2022-02-17_22:49:55-rados-wip-yuri3-testing-2022-02-17-1256-distro-default-smithi/6692376

2022-02-18T09:43:35.363+0000 7fd96955b700 10 osd.0 pg_epoch: 529 pg[3.6s0( v 529'462 (0'0,529'462] local-lis/les=526/527 n=20 ec=26/26 lis/c=526/461 les/c/f=527/462/0 sis=526) [0,1,5,8,11,4]/[0,1,NONE,8,11,4]p0(0) async=[5(2)] r=0 lpr=526 pi=[461,526)/1 luod=520'459 crt=520'459 lcod 519'457 mlcod 0'0 active+recovering+undersized+remapped rops=6 mbc={0={(1+0)=8},1={(1+0)=8},2={(0+0)=2,(0+1)=6},3={(1+0)=8},4={(1+0)=8},5={(1+0)=8}} trimq=[13d~1,192~1,1a9~1] ps=[12c~1,132~1,15d~1,16a~1,17d~1]] get_object_context: 0x55a83cd50000 3:60e7a2b9:::smithi01822269-15:head rwstate(write n=2 w=1) oi: 3:60e7a2b9:::smithi01822269-15:head(529'462 client.4629.0:5439 dirty|data_digest s 2408448 uv 634 dd a180df3a alloc_hint [0 0 0]) exists: 1 ssc: 0x55a83bf5cb00 snapset: 1ad=[]:{120=[102],132=[129],13f=[13d],15d=[157],16f=[16d,169,165,161,160],18d=[18d,18c,18a,187,182],195=[195,194,192,18f,18e],197=[197,196],1ad=[1ad,1ac,1a7,1a3,1a2,1a0,19f,19e,19d,19b]}
2022-02-18T09:43:35.363+0000 7fd96955b700 20 osd.0 pg_epoch: 529 pg[3.6s0( v 529'462 (0'0,529'462] local-lis/les=526/527 n=20 ec=26/26 lis/c=526/461 les/c/f=527/462/0 sis=526) [0,1,5,8,11,4]/[0,1,NONE,8,11,4]p0(0) async=[5(2)] r=0 lpr=526 pi=[461,526)/1 luod=520'459 crt=520'459 lcod 519'457 mlcod 0'0 active+recovering+undersized+remapped rops=6 mbc={0={(1+0)=8},1={(1+0)=8},2={(0+0)=2,(0+1)=6},3={(1+0)=8},4={(1+0)=8},5={(1+0)=8}} trimq=[13d~1,192~1,1a9~1] ps=[12c~1,132~1,15d~1,16a~1,17d~1]] recovery delayed on 3:60e7a2b9:::smithi01822269-15:head; could not get rw_manager lock

#7 Updated by Radoslaw Zarzynski about 2 years ago

  • Duplicates Bug #51076: "wait_for_recovery: failed before timeout expired" during thrashosd test with EC backend. added

#8 Updated by Laura Flores almost 2 years ago

  • Status changed from Need More Info to Resolved

Also available in: Atom PDF