Bug #7121: nightlies: failed to become clean before timeout expired due to degraded pg - Ceph - Ceph

Actions

Copy link

Bug #7121

closed

nightlies: failed to become clean before timeout expired due to degraded pg

Added by David Zafman over 10 years ago. Updated over 10 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Sage Weil

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

This issue has been seen before on master branch as well as testing of wip-5858-rebase testing.

teuthology:/a/teuthology-2014-01-05_23:00:02-rados-master-testing-basic-plana/27394

Failed =================================================================
[27394] rados/thrash/{clusters/fixed-2.yaml fs/btrfs.yaml msgr-failures/osd-delay.yaml thrashers/mapgap.yaml workloads/snaps-few-objects.yaml}
-----------------------------------------------------------------
time: 3479s
log: http://qa-proxy.ceph.com/teuthology/teuthology-2014-01-05_23:00:02-rados-master-testing-basic-plana/27394/

failed to become clean before timeout expired

2014-01-06 12:35:43.941375 7fc3d8df6700 5 osd.0 pg_epoch: 951 pg[1.37( empty local-les=951 n=0 ec=1 les/c 873/873 950/950/950) [0] r=0 lpr=950 pi=870-949/2 mlcod 0'0 active+degraded] enter Started/Primary/Active/Activating
2014-01-06 12:35:43.954608 7fc3d8df6700 5 osd.0 pg_epoch: 951 pg[1.37( empty local-les=951 n=0 ec=1 les/c 873/873 950/950/950) [0] r=0 lpr=950 pi=870-949/2 mlcod 0'0 active+degraded] exit Started/Primary/Active/Activating 0.013233 2 0.000168
2014-01-06 12:35:43.954631 7fc3d8df6700 5 osd.0 pg_epoch: 951 pg[1.37( empty local-les=951 n=0 ec=1 les/c 873/873 950/950/950) [0] r=0 lpr=950 pi=870-949/2 mlcod 0'0 active+degraded] enter Started/Primary/Active/Recovered
2014-01-06 12:35:43.984567 7fc3d8df6700 5 osd.0 pg_epoch: 951 pg[1.37( empty local-les=951 n=0 ec=1 les/c 951/873 950/950/950) [0] r=0 lpr=950 pi=870-949/2 mlcod 0'0 active+degraded] exit Started/Primary/Active/Recovered 0.029937 2 0.000055
2014-01-06 12:35:43.984610 7fc3d8df6700 5 osd.0 pg_epoch: 951 pg[1.37( empty local-les=951 n=0 ec=1 les/c 951/873 950/950/950) [0] r=0 lpr=950 pi=870-949/2 mlcod 0'0 active+degraded] enter Started/Primary/Active/Clean

The up set should have been [0,3] but up and acting is [0] for some reason and we aren't backfilling to 3.

Files

osd.map (4.84 KB) osd.map

Binary osd map

David Zafman, 01/09/2014 10:59 AM

Actions

Copy link

Updated by David Zafman over 10 years ago

File osd.map osd.map added
Assignee changed from David Zafman to Greg Farnum

crush map rule

rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step choose firstn 0 type osd
step emit
}

$ ./osdmaptool osd.map --test-map-pg 1.37
./osdmaptool: osdmap file 'osd.map'
parsed '1.37' -> 1.37
1.37 raw [0] up [0] acting [0]

osd.0 up in weight 1 up_from 835 up_thru 951 down_at 834 last_clean_interval [4,817) 10.214.133.25:6804/20507 10.214.133.25:6805/20507 10.214.133.25:6806/20507 10.214.133.25:6807/20507 exists,up 5e9ce54b-dab2-43ff-ba45-68efa3091382
osd.1 up out weight 0 up_from 870 up_thru 870 down_at 869 last_clean_interval [714,868) 10.214.133.25:6800/26507 10.214.133.25:6801/26507 10.214.133.25:6802/26507 10.214.133.25:6803/26507 exists,up 6e5a5df6-45ad-4b71-962a-aa883d36f7dd
osd.2 up out weight 0 up_from 930 up_thru 847 down_at 904 last_clean_interval [847,902) 10.214.133.25:6810/27615 10.214.133.25:6811/27615 10.214.133.25:6812/27615 10.214.133.25:6813/27615 exists,up 9577e15b-f0ab-4c65-8070-16b60d9fac25
osd.3 up in weight 1 up_from 772 up_thru 951 down_at 767 last_clean_interval [4,761) 10.214.132.22:6801/7982 10.214.132.22:6813/7982 10.214.132.22:6815/7982 10.214.132.22:6816/7982 exists,up 2a204e6e-e2e0-4a68-b6d2-9addcbb12bb6
osd.4 down out weight 0 up_from 290 up_thru 887 down_at 950 last_clean_interval [4,255) 10.214.132.22:6800/1244 10.214.132.22:6804/1244 10.214.132.22:6807/1244 10.214.132.22:6810/1244 exists d651871a-cea5-46e9-97d9-a289afe5a2d2
osd.5 up out weight 0 up_from 737 up_thru 835 down_at 736 last_clean_interval [664,727) 10.214.132.22:6802/5141 10.214.132.22:6803/5141 10.214.132.22:6814/5141 10.214.132.22:6818/5141 exists,up e0971dec-9a92-41bd-9126-5189f467f9d7

Actions

Copy link

Updated by Greg Farnum over 10 years ago

Assignee changed from Greg Farnum to Samuel Just
Source changed from other to Q/A

Sam suggests that maybe the retry behavior got changed erroneously with the recent refactor, and that that's what caused this. (He hasn't seen this test case fail this way before.)

Actions

Copy link

Updated by Sage Weil over 10 years ago

the behavior is the same as on dumpling, so this isn't a recent regression.

old tunables work for this pg, but tend to suck in general. i think the local retry behavior helps for extreme situations like this, but tends to skew the distribution in general.

the new tunables (with tries 50) do not work, obviously. changing total tries 100 does work, though. a modest increase there could help. but i'm hesitant to bump it too high as it can burn a lot of cpu in extreme situations.

not sure what we want to do here.. maybe just increase the min 'in' in the thrasher to 3 instead of 2?

Actions

Copy link