Project

General

Profile

Bug #5059

PGs can get stuck degraded if OSD removed before being out

Added by David Zafman almost 11 years ago. Updated almost 11 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

If an OSD goes down and the user marks it lost and/or removes before it is marked out, then PGs are left degraded and new locations are not selected. This is the example when I got into this state.

$ ./ceph pg dump
dumped all in format plain
version 300
stamp 2013-05-13 17:58:02.994100
last_osdmap_epoch 121
last_pg_scan 17
full_ratio 0.95
nearfull_ratio 0.85
pg_stat objects mip degr unf bytes log disklog state state_stamp v reported up acting last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
3.4 11 0 11 0 154 0 0 active+degraded 2013-05-13 17:53:32.545272 117'242 17'357 [1] [1] 0'0 0.000000 0'0 0.000000
3.5 20 0 20 0 279 0 0 active+degraded 2013-05-13 17:53:32.564806 102'440 17'551 [2] [2] 0'0 0.000000 0'0 0.000000
3.6 12 0 12 0 168 0 0 active+degraded 2013-05-13 17:53:32.549988 115'264 119'384 [1] [1] 0'0 0.000000 0'0 0.000000
3.7 10 0 10 0 138 0 0 active+degraded 2013-05-13 17:53:32.540431 111'220 119'340 [0] [0] 0'0 0.000000 0'0 0.000000
3.0 18 0 18 0 8236 0 0 active+degraded 2013-05-13 17:53:32.558160 118'376 17'508 [0] [0] 0'0 0.000000 0'0 0.000000
3.1 8 0 0 0 110 0 0 active+clean 2013-05-13 17:41:44.205230 116'176 17'282 [1,0] [1,0] 0'0 0.000000 0'0 0.000000
3.2 10 0 0 0 140 0 0 active+clean 2013-05-13 17:41:44.238774 103'220 17'312 [3,0] [3,0] 0'0 0.000000 0'0 0.000000
3.3 12 0 12 0 167 0 0 active+degraded 2013-05-13 17:53:32.596090 112'264 17'368 [2] [2] 0'0 0.000000 0'0 0.000000
pool 3 101 0 83 0 9392 0 0
sum 101 0 83 0 9392 0 0
osdstat kbused kbavail kb hb in hb out
0 17174964 2682232 20905820 [] []
1 17174964 2682232 20905820 [0] []
2 17174964 2682232 20905820 [] []
3 17174964 2682232 20905820 [0] []
sum 68699856 10728928 83623280

History

#1 Updated by Sage Weil almost 11 years ago

what does 'ceph osd tree' say? usually stuck degraded happens bc there aren't enough up/in osds

#2 Updated by David Zafman almost 11 years ago

  • Subject changed from PGs can get stuck degraded if OSD marked lost or removed before being out to PGs can get stuck degraded if OSD removed before being out

CORRECTION: A lost OSD can be marked out and crush will recalculate replica locations. But administrator accidentally removes the OSD, not sure how to fix it.

I killed osd.3. I think we can ignore osd.4 and osd.5 which are out and part of a previous test.

dzafman@ubuntu:~/ceph/src$ ./ceph -s
health HEALTH_WARN 1 pgs stale; 1/4 in osds are down
monmap e1: 3 mons at {a=127.0.0.1:6789/0,b=127.0.0.1:6790/0,c=127.0.0.1:6791/0}, election epoch 4, quorum 0,1,2 a,b,c
osdmap e133: 6 osds: 3 up, 4 in
pgmap v252: 8 pgs: 7 active+clean, 1 stale+active+clean; 9392 bytes data, 67510 MB used, 10056 MB / 81663 MB avail
mdsmap e10: 3/3/3 up {0=a=up:active,1=c=up:active,2=b=up:active}

dzafman@ubuntu:~/ceph/src$ ./ceph osd rm 3
removed osd.3
dzafman@ubuntu:~/ceph/src$ ./ceph osd tree

  1. id weight type name up/down reweight
    -1 6 root default
    -3 6 rack localrack
    -2 6 host localhost
    0 1 osd.0 up 1
    1 1 osd.1 up 1
    2 1 osd.2 up 1
    3 1 osd.3 DNE
    4 1 osd.4 down 0
    5 1 osd.5 down 0

dzafman@ubuntu:~/ceph/src$ ./ceph pg dump
dumped all in format plain
version 258
stamp 2013-05-15 23:21:22.071919
last_osdmap_epoch 134
last_pg_scan 19
full_ratio 0.95
nearfull_ratio 0.85
pg_stat objects mip degr unf bytes log disklog state state_stamp v reported up acting last_scrub scrub_stamp last_deep_scrub deep
_scrub_stamp
3.4 11 0 0 0 154 0 0 active+clean 2013-05-15 23:17:52.460277 119'242 19'378 [1,2] [1,2] 0'0 2013-05-15 23:10:55.794592 0
'0 2013-05-15 23:10:55.794592
3.5 20 0 0 0 279 0 0 active+clean 2013-05-15 23:15:21.787426 104'440 19'611 [2,0] [2,0] 0'0 2013-05-15 23:10:55.806434 0
'0 2013-05-15 23:10:55.806434
3.6 12 0 12 0 168 0 0 active+degraded 2013-05-15 23:21:04.169123 117'264 126'418 [1] [1] 0'0 2013-05-15 23:10:55.797912 0
'0 2013-05-15 23:10:55.797912
3.7 10 0 0 0 138 0 0 active+clean 2013-05-15 23:17:52.254671 113'220 126'372 [0,2] [0,2] 0'0 2013-05-15 23:10:55.798423 0
'0 2013-05-15 23:10:55.798423
3.0 18 0 18 0 8236 0 0 active+degraded 2013-05-15 23:21:04.169924 120'376 19'544 [0] [0] 0'0 2013-05-15 23:10:55.794384 0
'0 2013-05-15 23:10:55.794384
3.1 8 0 0 0 110 0 0 active+clean 2013-05-15 23:17:52.419279 118'176 121'315 [0,1] [0,1] 0'0 2013-05-15 23:10:55.808257 0
'0 2013-05-15 23:10:55.808257
3.2 10 0 10 0 140 0 0 active+degraded 2013-05-15 23:21:04.173774 105'220 132'295 [0] [0] 0'0 2013-05-15 23:10:55.808851 0'0 2013-05-15 23:10:55.808851
3.3 12 0 0 0 167 0 0 active+clean 2013-05-15 23:17:52.282817 114'264 19'411 [2,1] [2,1] 0'0 2013-05-15 23:10:55.805762 0'0 2013-05-15 23:10:55.805762
pool 3 101 0 40 0 9392 0 0
sum 101 0 40 0 9392 0 0
osdstat kbused kbavail kb hb in hb out
0 17283740 2573456 20905820 [1,2] []
1 17283796 2573400 20905820 [2] []
2 17283952 2573244 20905820 [0,1] []
sum 51851488 7720100 62717460

dzafman@ubuntu:~/ceph/src$ ./ceph osd out 3
osd.3 does not exist.

It looks like I can't clean up after osd.3 now that it was removed.

#3 Updated by Sage Weil almost 11 years ago

  • Status changed from New to Won't Fix

Also available in: Atom PDF