Bug #7121
nightlies: failed to become clean before timeout expired due to degraded pg
0%
Description
This issue has been seen before on master branch as well as testing of wip-5858-rebase testing.
teuthology:/a/teuthology-2014-01-05_23:00:02-rados-master-testing-basic-plana/27394
Failed
=================================================================
[27394] rados/thrash/{clusters/fixed-2.yaml fs/btrfs.yaml msgr-failures/osd-delay.yaml thrashers/mapgap.yaml workloads/snaps-few-objects.yaml}
-----------------------------------------------------------------
time: 3479s
log: http://qa-proxy.ceph.com/teuthology/teuthology-2014-01-05_23:00:02-rados-master-testing-basic-plana/27394/
failed to become clean before timeout expired
2014-01-06 12:35:43.941375 7fc3d8df6700 5 osd.0 pg_epoch: 951 pg[1.37( empty local-les=951 n=0 ec=1 les/c 873/873 950/950/950) [0] r=0 lpr=950 pi=870-949/2 mlcod 0'0 active+degraded] enter Started/Primary/Active/Activating
2014-01-06 12:35:43.954608 7fc3d8df6700 5 osd.0 pg_epoch: 951 pg[1.37( empty local-les=951 n=0 ec=1 les/c 873/873 950/950/950) [0] r=0 lpr=950 pi=870-949/2 mlcod 0'0 active+degraded] exit Started/Primary/Active/Activating 0.013233 2 0.000168
2014-01-06 12:35:43.954631 7fc3d8df6700 5 osd.0 pg_epoch: 951 pg[1.37( empty local-les=951 n=0 ec=1 les/c 873/873 950/950/950) [0] r=0 lpr=950 pi=870-949/2 mlcod 0'0 active+degraded] enter Started/Primary/Active/Recovered
2014-01-06 12:35:43.984567 7fc3d8df6700 5 osd.0 pg_epoch: 951 pg[1.37( empty local-les=951 n=0 ec=1 les/c 951/873 950/950/950) [0] r=0 lpr=950 pi=870-949/2 mlcod 0'0 active+degraded] exit Started/Primary/Active/Recovered 0.029937 2 0.000055
2014-01-06 12:35:43.984610 7fc3d8df6700 5 osd.0 pg_epoch: 951 pg[1.37( empty local-les=951 n=0 ec=1 les/c 951/873 950/950/950) [0] r=0 lpr=950 pi=870-949/2 mlcod 0'0 active+degraded] enter Started/Primary/Active/Clean
The up set should have been [0,3] but up and acting is [0] for some reason and we aren't backfilling to 3.
History
#1 Updated by David Zafman over 9 years ago
- File osd.map added
- Assignee changed from David Zafman to Greg Farnum
- rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step choose firstn 0 type osd
step emit
}
$ ./osdmaptool osd.map --test-map-pg 1.37
./osdmaptool: osdmap file 'osd.map'
parsed '1.37' -> 1.37
1.37 raw [0] up [0] acting [0]
osd.0 up in weight 1 up_from 835 up_thru 951 down_at 834 last_clean_interval [4,817) 10.214.133.25:6804/20507 10.214.133.25:6805/20507 10.214.133.25:6806/20507 10.214.133.25:6807/20507 exists,up 5e9ce54b-dab2-43ff-ba45-68efa3091382
osd.1 up out weight 0 up_from 870 up_thru 870 down_at 869 last_clean_interval [714,868) 10.214.133.25:6800/26507 10.214.133.25:6801/26507 10.214.133.25:6802/26507 10.214.133.25:6803/26507 exists,up 6e5a5df6-45ad-4b71-962a-aa883d36f7dd
osd.2 up out weight 0 up_from 930 up_thru 847 down_at 904 last_clean_interval [847,902) 10.214.133.25:6810/27615 10.214.133.25:6811/27615 10.214.133.25:6812/27615 10.214.133.25:6813/27615 exists,up 9577e15b-f0ab-4c65-8070-16b60d9fac25
osd.3 up in weight 1 up_from 772 up_thru 951 down_at 767 last_clean_interval [4,761) 10.214.132.22:6801/7982 10.214.132.22:6813/7982 10.214.132.22:6815/7982 10.214.132.22:6816/7982 exists,up 2a204e6e-e2e0-4a68-b6d2-9addcbb12bb6
osd.4 down out weight 0 up_from 290 up_thru 887 down_at 950 last_clean_interval [4,255) 10.214.132.22:6800/1244 10.214.132.22:6804/1244 10.214.132.22:6807/1244 10.214.132.22:6810/1244 exists d651871a-cea5-46e9-97d9-a289afe5a2d2
osd.5 up out weight 0 up_from 737 up_thru 835 down_at 736 last_clean_interval [664,727) 10.214.132.22:6802/5141 10.214.132.22:6803/5141 10.214.132.22:6814/5141 10.214.132.22:6818/5141 exists,up e0971dec-9a92-41bd-9126-5189f467f9d7
#2 Updated by Greg Farnum over 9 years ago
- Assignee changed from Greg Farnum to Samuel Just
- Source changed from other to Q/A
Sam suggests that maybe the retry behavior got changed erroneously with the recent refactor, and that that's what caused this. (He hasn't seen this test case fail this way before.)
#3 Updated by Sage Weil over 9 years ago
the behavior is the same as on dumpling, so this isn't a recent regression.
old tunables work for this pg, but tend to suck in general. i think the local retry behavior helps for extreme situations like this, but tends to skew the distribution in general.
the new tunables (with tries 50) do not work, obviously. changing total tries 100 does work, though. a modest increase there could help. but i'm hesitant to bump it too high as it can burn a lot of cpu in extreme situations.
not sure what we want to do here.. maybe just increase the min 'in' in the thrasher to 3 instead of 2?
#4 Updated by Sage Weil over 9 years ago
- Status changed from New to 4
- Assignee changed from Samuel Just to Sage Weil
- Priority changed from Normal to Urgent
#5 Updated by Greg Farnum over 9 years ago
Yeah; our tests run a lot closer to the edges than users really do or should, so I'd just adjust the thrasher's min_in.
#6 Updated by Sage Weil over 9 years ago
- Status changed from 4 to Resolved