Bug #9911
closedceph not placing replicas to OSDs on same host as down/out OSD
0%
Description
On a 3 node firefly cluster with 6 OSDs per host and 3x replication, when noup is set and 1 OSD is marked down/out, a number of PGs get set active+degraded and never recover with only 2 OSDs in the acting set.
The crush rule governing replication is:
[pre]
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
[/pre]
changing the rule to "type osd" fixes the problem as might be expected, though presumably "type host" should also work in this scenario. Attached is a pg dump, osd map, and crush map.
An example reproduced pg using osdmaptool also shows only 2 OSDs in the acting set:
[pre]
regression@plana15:/tmp/cbt/ceph/log$ osdmaptool /tmp/osd.map --test-map-pg 1.ffe
osdmaptool: osdmap file '/tmp/osd.map'
parsed '1.ffe' -> 1.ffe
1.ffe raw ([9,3], p9) up ([9,3], p9) acting ([9,3], p9)
[/pre]
While this cluster is firefly, I believe this can be reproduced in giant as well.
Files