Project

General

Profile

Bug #23647

thrash-eio test can prevent recovery

Added by Sage Weil almost 6 years ago. Updated over 3 years ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are injecting random EIOs. However, in a recovery situation an EIO leads us to decide the object is missing in one location. THat might be the only location, or we might also get an EIO on the other copy or copies.

Example is /a/sage-2018-04-10_19:17:57-rados-wip-sage-testing-2018-04-10-1032-distro-basic-smithi/2380817

pg is backfilling and gets EIO on the local copy. There are a few thousand objects, and by the end we have missing+unfound on 77 of them.

2018-04-10 21:59:11.585 7f53d1b5d700  7 osd.1 pg_epoch: 87 pg[2.5( v 40'2383 (0'0,40'2383] local-lis/les=85/86 n=2383 ec=18/18 lis/c 85/0 les/c/f 86/0/0 84/85/18) [1,4]/[1] async=[4] r=0 lpr=85 pi=[18,85)/1 rops=1 crt=40'2383 lcod 40'2382 mlcod 0'0 active+recovering+undersized+degraded+remapped mbc={255={(1+0)=2305}}] build_push_op 2:a773b497:::benchmark_data_smithi177_12993_object1639:head v 21'79 size 1048576 recovery_info: ObjectRecoveryInfo(2:a773b497:::benchmark_data_smithi177_12993_object1639:head@21'79, size: 1048576, copy_subset: [0~1048576], clone_subset: {}, snapset: 0=[]:{})
2018-04-10 21:59:11.585 7f53d1b5d700 10 osd.1 pg_epoch: 87 pg[2.5( v 40'2383 (0'0,40'2383] local-lis/les=85/86 n=2383 ec=18/18 lis/c 85/0 les/c/f 86/0/0 84/85/18) [1,4]/[1] async=[4] r=0 lpr=85 pi=[18,85)/1 rops=1 crt=40'2383 lcod 40'2382 mlcod 0'0 active+recovering+undersized+degraded+remapped mbc={255={(1+0)=2305}}] start_pushes clean up peer 4
2018-04-10 21:59:11.585 7f53d1b5d700  0 osd.1 pg_epoch: 87 pg[2.5( v 40'2383 (0'0,40'2383] local-lis/les=85/86 n=2383 ec=18/18 lis/c 85/0 les/c/f 86/0/0 84/85/18) [1,4]/[1] async=[4] r=0 lpr=85 pi=[18,85)/1 rops=1 crt=40'2383 lcod 40'2382 mlcod 0'0 active+recovering+undersized+degraded+remapped mbc={255={(1+0)=2305}}] prep_object_replica_pushes Error -5 on oid 2:a773b497:::benchmark_data_smithi177_12993_object1639:head
2018-04-10 21:59:11.585 7f53d1b5d700 20 osd.1 pg_epoch: 87 pg[2.5( v 40'2383 (0'0,40'2383] local-lis/les=85/86 n=2383 ec=18/18 lis/c 85/0 les/c/f 86/0/0 84/85/18) [1,4]/[1] async=[4] r=0 lpr=85 pi=[18,85)/1 rops=1 crt=40'2383 lcod 40'2382 mlcod 0'0 active+recovering+undersized+degraded+remapped mbc={255={(1+0)=2305}}] failed_push: 2:a773b497:::benchmark_data_smithi177_12993_object1639:head
2018-04-10 21:59:11.585 7f53d1b5d700  0 osd.1 pg_epoch: 87 pg[2.5( v 40'2383 (0'0,40'2383] local-lis/les=85/86 n=2383 ec=18/18 lis/c 85/0 les/c/f 86/0/0 84/85/18) [1,4]/[1] async=[4] r=0 lpr=85 pi=[18,85)/1 rops=1 crt=40'2383 lcod 40'2382 mlcod 0'0 active+recovering+undersized+degraded+remapped mbc={255={(0+0)=1,(1+0)=2304}}] failed_push 2:a773b497:::benchmark_data_smithi177_12993_object1639:head from shard 1, reps on  unfound? 1
2018-04-10 21:59:11.585 7f53d1b5d700 10 osd.1 pg_epoch: 87 pg[2.5( v 40'2383 (0'0,40'2383] local-lis/les=85/86 n=2383 ec=18/18 lis/c 85/0 les/c/f 86/0/0 84/85/18) [1,4]/[1] async=[4] r=0 lpr=85 pi=[18,85)/1 rops=1 crt=40'2383 lcod 40'2382 mlcod 0'0 active+recovering+undersized+degraded+remapped mbc={255={(0+0)=1,(1+0)=2304}}] finish_recovery_op 2:a773b497:::benchmark_data_smithi177_12993_object1639:head
2018-04-10 21:59:11.585 7f53d1b5d700 10 osd.1 87 finish_recovery_op pg[2.5( v 40'2383 (0'0,40'2383] local-lis/les=85/86 n=2383 ec=18/18 lis/c 85/0 les/c/f 86/0/0 84/85/18) [1,4]/[1] async=[4] r=0 lpr=85 pi=[18,85)/1 crt=40'2383 lcod 40'2382 mlcod 0'0 active+recovering+undersized+degraded+remapped mbc={255={(0+0)=1,(1+0)=2304}}] 2:a773b497:::benchmark_data_smithi177_12993_object1639:head dequeue=0 (1/3 rops)
2018-04-10 21:59:11.585 7f53d1b5d700 10 osd.1 pg_epoch: 87 pg[2.5( v 40'2383 (0'0,40'2383] local-lis/les=85/86 n=2383 ec=18/18 lis/c 85/0 les/c/f 86/0/0 84/85/18) [1,4]/[1] async=[4] r=0 lpr=85 pi=[18,85)/1 crt=40'2383 lcod 40'2382 mlcod 0'0 active+recovering+undersized+degraded+remapped mbc={255={(0+0)=1,(1+0)=2304}}] queue_recovery -- queuing
2018-04-10 21:59:11.585 7f53d1b5d700 20 osd.1 op_wq(5) _enqueue OpQueueItem(2.5 PGRecovery(pgid=2.5epoch_queued=87reserved_pushes=1) prio 5 cost 20971520 e87 reserved_pushes 1)
2018-04-10 21:59:11.585 7f53d1b5d700 -1 log_channel(cluster) log [ERR] : 2.5 missing primary copy of 2:a773b497:::benchmark_data_smithi177_12993_object1639:head, unfound
2018-04-10 21:59:11.585 7f53d1b5d700 10 osd.1 pg_epoch: 87 pg[2.5( v 40'2383 (0'0,40'2383] local-lis/les=85/86 n=2383 ec=18/18 lis/c 85/0 les/c/f 86/0/0 84/85/18) [1,4]/[1] async=[4] r=0 lpr=85 pi=[18,85)/1 crt=40'2383 lcod 40'2382 mlcod 0'0 active+recovering+undersized+degraded+remapped m=1 u=1 mbc={255={(0+0)=1,(1+0)=2304}}] recover_replicas: recover_object_replicas(2:adc0de62:::benchmark_data_smithi177_12993_object1647:head)

History

#1 Updated by Greg Farnum over 4 years ago

Several proposals that might improve things:

  • from Josh, just turn down the odds
  • from Greg, is it plausible to limit the error injection range so that only a single shard can EIO on a given object?
  • from David, we could make tests with random EIO injection clear out the missing sets and retry.

#2 Updated by Neha Ojha over 4 years ago

  • Assignee set to Neha Ojha

#3 Updated by Neha Ojha over 4 years ago

  • Status changed from 12 to In Progress

http://pulpito.ceph.com/nojha-2019-09-06_14:33:54-rados:singleton-wip-41385-3-distro-basic-smithi/ - this is where I noticed such a failure in 100 runs of the same test.

rados:singleton/{all/thrash-eio.yaml msgr-failures/few.yaml msgr/async-v1only.yaml objectstore/bluestore-low-osd-mem-target.yaml rados.yaml supported-random-distro$/{ubuntu_latest.yaml}}

#4 Updated by Neha Ojha over 3 years ago

  • Priority changed from High to Normal

Also available in: Atom PDF