Project

General

Profile

Actions

Bug #7121

closed

nightlies: failed to become clean before timeout expired due to degraded pg

Added by David Zafman over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This issue has been seen before on master branch as well as testing of wip-5858-rebase testing.

teuthology:/a/teuthology-2014-01-05_23:00:02-rados-master-testing-basic-plana/27394

Failed =================================================================
[27394] rados/thrash/{clusters/fixed-2.yaml fs/btrfs.yaml msgr-failures/osd-delay.yaml thrashers/mapgap.yaml workloads/snaps-few-objects.yaml}
-----------------------------------------------------------------
time: 3479s
log: http://qa-proxy.ceph.com/teuthology/teuthology-2014-01-05_23:00:02-rados-master-testing-basic-plana/27394/

failed to become clean before timeout expired

2014-01-06 12:35:43.941375 7fc3d8df6700 5 osd.0 pg_epoch: 951 pg[1.37( empty local-les=951 n=0 ec=1 les/c 873/873 950/950/950) [0] r=0 lpr=950 pi=870-949/2 mlcod 0'0 active+degraded] enter Started/Primary/Active/Activating
2014-01-06 12:35:43.954608 7fc3d8df6700 5 osd.0 pg_epoch: 951 pg[1.37( empty local-les=951 n=0 ec=1 les/c 873/873 950/950/950) [0] r=0 lpr=950 pi=870-949/2 mlcod 0'0 active+degraded] exit Started/Primary/Active/Activating 0.013233 2 0.000168
2014-01-06 12:35:43.954631 7fc3d8df6700 5 osd.0 pg_epoch: 951 pg[1.37( empty local-les=951 n=0 ec=1 les/c 873/873 950/950/950) [0] r=0 lpr=950 pi=870-949/2 mlcod 0'0 active+degraded] enter Started/Primary/Active/Recovered
2014-01-06 12:35:43.984567 7fc3d8df6700 5 osd.0 pg_epoch: 951 pg[1.37( empty local-les=951 n=0 ec=1 les/c 951/873 950/950/950) [0] r=0 lpr=950 pi=870-949/2 mlcod 0'0 active+degraded] exit Started/Primary/Active/Recovered 0.029937 2 0.000055
2014-01-06 12:35:43.984610 7fc3d8df6700 5 osd.0 pg_epoch: 951 pg[1.37( empty local-les=951 n=0 ec=1 les/c 951/873 950/950/950) [0] r=0 lpr=950 pi=870-949/2 mlcod 0'0 active+degraded] enter Started/Primary/Active/Clean

The up set should have been [0,3] but up and acting is [0] for some reason and we aren't backfilling to 3.


Files

osd.map (4.84 KB) osd.map Binary osd map David Zafman, 01/09/2014 10:59 AM
Actions #1

Updated by David Zafman over 10 years ago

  • File osd.map osd.map added
  • Assignee changed from David Zafman to Greg Farnum
crush map rule
  1. rules
    rule replicated_ruleset {
    ruleset 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step choose firstn 0 type osd
    step emit
    }

$ ./osdmaptool osd.map --test-map-pg 1.37
./osdmaptool: osdmap file 'osd.map'
parsed '1.37' -> 1.37
1.37 raw [0] up [0] acting [0]

osd.0 up in weight 1 up_from 835 up_thru 951 down_at 834 last_clean_interval [4,817) 10.214.133.25:6804/20507 10.214.133.25:6805/20507 10.214.133.25:6806/20507 10.214.133.25:6807/20507 exists,up 5e9ce54b-dab2-43ff-ba45-68efa3091382
osd.1 up out weight 0 up_from 870 up_thru 870 down_at 869 last_clean_interval [714,868) 10.214.133.25:6800/26507 10.214.133.25:6801/26507 10.214.133.25:6802/26507 10.214.133.25:6803/26507 exists,up 6e5a5df6-45ad-4b71-962a-aa883d36f7dd
osd.2 up out weight 0 up_from 930 up_thru 847 down_at 904 last_clean_interval [847,902) 10.214.133.25:6810/27615 10.214.133.25:6811/27615 10.214.133.25:6812/27615 10.214.133.25:6813/27615 exists,up 9577e15b-f0ab-4c65-8070-16b60d9fac25
osd.3 up in weight 1 up_from 772 up_thru 951 down_at 767 last_clean_interval [4,761) 10.214.132.22:6801/7982 10.214.132.22:6813/7982 10.214.132.22:6815/7982 10.214.132.22:6816/7982 exists,up 2a204e6e-e2e0-4a68-b6d2-9addcbb12bb6
osd.4 down out weight 0 up_from 290 up_thru 887 down_at 950 last_clean_interval [4,255) 10.214.132.22:6800/1244 10.214.132.22:6804/1244 10.214.132.22:6807/1244 10.214.132.22:6810/1244 exists d651871a-cea5-46e9-97d9-a289afe5a2d2
osd.5 up out weight 0 up_from 737 up_thru 835 down_at 736 last_clean_interval [664,727) 10.214.132.22:6802/5141 10.214.132.22:6803/5141 10.214.132.22:6814/5141 10.214.132.22:6818/5141 exists,up e0971dec-9a92-41bd-9126-5189f467f9d7

Actions #2

Updated by Greg Farnum over 10 years ago

  • Assignee changed from Greg Farnum to Samuel Just
  • Source changed from other to Q/A

Sam suggests that maybe the retry behavior got changed erroneously with the recent refactor, and that that's what caused this. (He hasn't seen this test case fail this way before.)

Actions #3

Updated by Sage Weil over 10 years ago

the behavior is the same as on dumpling, so this isn't a recent regression.

old tunables work for this pg, but tend to suck in general. i think the local retry behavior helps for extreme situations like this, but tends to skew the distribution in general.

the new tunables (with tries 50) do not work, obviously. changing total tries 100 does work, though. a modest increase there could help. but i'm hesitant to bump it too high as it can burn a lot of cpu in extreme situations.

not sure what we want to do here.. maybe just increase the min 'in' in the thrasher to 3 instead of 2?

Actions #4

Updated by Sage Weil over 10 years ago

  • Status changed from New to 4
  • Assignee changed from Samuel Just to Sage Weil
  • Priority changed from Normal to Urgent
Actions #5

Updated by Greg Farnum over 10 years ago

Yeah; our tests run a lot closer to the edges than users really do or should, so I'd just adjust the thrasher's min_in.

Actions #6

Updated by Sage Weil over 10 years ago

  • Status changed from 4 to Resolved
Actions

Also available in: Atom PDF