Bug #23647
openthrash-eio test can prevent recovery
0%
Description
We are injecting random EIOs. However, in a recovery situation an EIO leads us to decide the object is missing in one location. THat might be the only location, or we might also get an EIO on the other copy or copies.
Example is /a/sage-2018-04-10_19:17:57-rados-wip-sage-testing-2018-04-10-1032-distro-basic-smithi/2380817
pg is backfilling and gets EIO on the local copy. There are a few thousand objects, and by the end we have missing+unfound on 77 of them.
2018-04-10 21:59:11.585 7f53d1b5d700 7 osd.1 pg_epoch: 87 pg[2.5( v 40'2383 (0'0,40'2383] local-lis/les=85/86 n=2383 ec=18/18 lis/c 85/0 les/c/f 86/0/0 84/85/18) [1,4]/[1] async=[4] r=0 lpr=85 pi=[18,85)/1 rops=1 crt=40'2383 lcod 40'2382 mlcod 0'0 active+recovering+undersized+degraded+remapped mbc={255={(1+0)=2305}}] build_push_op 2:a773b497:::benchmark_data_smithi177_12993_object1639:head v 21'79 size 1048576 recovery_info: ObjectRecoveryInfo(2:a773b497:::benchmark_data_smithi177_12993_object1639:head@21'79, size: 1048576, copy_subset: [0~1048576], clone_subset: {}, snapset: 0=[]:{}) 2018-04-10 21:59:11.585 7f53d1b5d700 10 osd.1 pg_epoch: 87 pg[2.5( v 40'2383 (0'0,40'2383] local-lis/les=85/86 n=2383 ec=18/18 lis/c 85/0 les/c/f 86/0/0 84/85/18) [1,4]/[1] async=[4] r=0 lpr=85 pi=[18,85)/1 rops=1 crt=40'2383 lcod 40'2382 mlcod 0'0 active+recovering+undersized+degraded+remapped mbc={255={(1+0)=2305}}] start_pushes clean up peer 4 2018-04-10 21:59:11.585 7f53d1b5d700 0 osd.1 pg_epoch: 87 pg[2.5( v 40'2383 (0'0,40'2383] local-lis/les=85/86 n=2383 ec=18/18 lis/c 85/0 les/c/f 86/0/0 84/85/18) [1,4]/[1] async=[4] r=0 lpr=85 pi=[18,85)/1 rops=1 crt=40'2383 lcod 40'2382 mlcod 0'0 active+recovering+undersized+degraded+remapped mbc={255={(1+0)=2305}}] prep_object_replica_pushes Error -5 on oid 2:a773b497:::benchmark_data_smithi177_12993_object1639:head 2018-04-10 21:59:11.585 7f53d1b5d700 20 osd.1 pg_epoch: 87 pg[2.5( v 40'2383 (0'0,40'2383] local-lis/les=85/86 n=2383 ec=18/18 lis/c 85/0 les/c/f 86/0/0 84/85/18) [1,4]/[1] async=[4] r=0 lpr=85 pi=[18,85)/1 rops=1 crt=40'2383 lcod 40'2382 mlcod 0'0 active+recovering+undersized+degraded+remapped mbc={255={(1+0)=2305}}] failed_push: 2:a773b497:::benchmark_data_smithi177_12993_object1639:head 2018-04-10 21:59:11.585 7f53d1b5d700 0 osd.1 pg_epoch: 87 pg[2.5( v 40'2383 (0'0,40'2383] local-lis/les=85/86 n=2383 ec=18/18 lis/c 85/0 les/c/f 86/0/0 84/85/18) [1,4]/[1] async=[4] r=0 lpr=85 pi=[18,85)/1 rops=1 crt=40'2383 lcod 40'2382 mlcod 0'0 active+recovering+undersized+degraded+remapped mbc={255={(0+0)=1,(1+0)=2304}}] failed_push 2:a773b497:::benchmark_data_smithi177_12993_object1639:head from shard 1, reps on unfound? 1 2018-04-10 21:59:11.585 7f53d1b5d700 10 osd.1 pg_epoch: 87 pg[2.5( v 40'2383 (0'0,40'2383] local-lis/les=85/86 n=2383 ec=18/18 lis/c 85/0 les/c/f 86/0/0 84/85/18) [1,4]/[1] async=[4] r=0 lpr=85 pi=[18,85)/1 rops=1 crt=40'2383 lcod 40'2382 mlcod 0'0 active+recovering+undersized+degraded+remapped mbc={255={(0+0)=1,(1+0)=2304}}] finish_recovery_op 2:a773b497:::benchmark_data_smithi177_12993_object1639:head 2018-04-10 21:59:11.585 7f53d1b5d700 10 osd.1 87 finish_recovery_op pg[2.5( v 40'2383 (0'0,40'2383] local-lis/les=85/86 n=2383 ec=18/18 lis/c 85/0 les/c/f 86/0/0 84/85/18) [1,4]/[1] async=[4] r=0 lpr=85 pi=[18,85)/1 crt=40'2383 lcod 40'2382 mlcod 0'0 active+recovering+undersized+degraded+remapped mbc={255={(0+0)=1,(1+0)=2304}}] 2:a773b497:::benchmark_data_smithi177_12993_object1639:head dequeue=0 (1/3 rops) 2018-04-10 21:59:11.585 7f53d1b5d700 10 osd.1 pg_epoch: 87 pg[2.5( v 40'2383 (0'0,40'2383] local-lis/les=85/86 n=2383 ec=18/18 lis/c 85/0 les/c/f 86/0/0 84/85/18) [1,4]/[1] async=[4] r=0 lpr=85 pi=[18,85)/1 crt=40'2383 lcod 40'2382 mlcod 0'0 active+recovering+undersized+degraded+remapped mbc={255={(0+0)=1,(1+0)=2304}}] queue_recovery -- queuing 2018-04-10 21:59:11.585 7f53d1b5d700 20 osd.1 op_wq(5) _enqueue OpQueueItem(2.5 PGRecovery(pgid=2.5epoch_queued=87reserved_pushes=1) prio 5 cost 20971520 e87 reserved_pushes 1) 2018-04-10 21:59:11.585 7f53d1b5d700 -1 log_channel(cluster) log [ERR] : 2.5 missing primary copy of 2:a773b497:::benchmark_data_smithi177_12993_object1639:head, unfound 2018-04-10 21:59:11.585 7f53d1b5d700 10 osd.1 pg_epoch: 87 pg[2.5( v 40'2383 (0'0,40'2383] local-lis/les=85/86 n=2383 ec=18/18 lis/c 85/0 les/c/f 86/0/0 84/85/18) [1,4]/[1] async=[4] r=0 lpr=85 pi=[18,85)/1 crt=40'2383 lcod 40'2382 mlcod 0'0 active+recovering+undersized+degraded+remapped m=1 u=1 mbc={255={(0+0)=1,(1+0)=2304}}] recover_replicas: recover_object_replicas(2:adc0de62:::benchmark_data_smithi177_12993_object1647:head)
Updated by Greg Farnum over 4 years ago
Several proposals that might improve things:
- from Josh, just turn down the odds
- from Greg, is it plausible to limit the error injection range so that only a single shard can EIO on a given object?
- from David, we could make tests with random EIO injection clear out the missing sets and retry.
Updated by Neha Ojha over 4 years ago
- Status changed from 12 to In Progress
http://pulpito.ceph.com/nojha-2019-09-06_14:33:54-rados:singleton-wip-41385-3-distro-basic-smithi/ - this is where I noticed such a failure in 100 runs of the same test.
rados:singleton/{all/thrash-eio.yaml msgr-failures/few.yaml msgr/async-v1only.yaml objectstore/bluestore-low-osd-mem-target.yaml rados.yaml supported-random-distro$/{ubuntu_latest.yaml}}