Bug #9911
closedceph not placing replicas to OSDs on same host as down/out OSD
0%
Description
On a 3 node firefly cluster with 6 OSDs per host and 3x replication, when noup is set and 1 OSD is marked down/out, a number of PGs get set active+degraded and never recover with only 2 OSDs in the acting set.
The crush rule governing replication is:
[pre]
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
[/pre]
changing the rule to "type osd" fixes the problem as might be expected, though presumably "type host" should also work in this scenario. Attached is a pg dump, osd map, and crush map.
An example reproduced pg using osdmaptool also shows only 2 OSDs in the acting set:
[pre]
regression@plana15:/tmp/cbt/ceph/log$ osdmaptool /tmp/osd.map --test-map-pg 1.ffe
osdmaptool: osdmap file '/tmp/osd.map'
parsed '1.ffe' -> 1.ffe
1.ffe raw ([9,3], p9) up ([9,3], p9) acting ([9,3], p9)
[/pre]
While this cluster is firefly, I believe this can be reproduced in giant as well.
Files
Updated by Mark Nelson over 9 years ago
ceph -s output with an OSD down and type host:
regression@plana15:/tmp/cbt/ceph/log$ ceph -s cluster 57fd3030-b7f8-4662-b115-bbbdd586c82d health HEALTH_WARN 919 pgs degraded; 40 pgs stuck unclean; recovery 168/7686 objects degraded (2.186%); noup flag(s) set monmap e1: 1 mons at {a=10.214.144.25:6789/0}, election epoch 2, quorum 0 a osdmap e258: 18 osds: 17 up, 17 in flags noup pgmap v7539: 16384 pgs, 4 pools, 10240 MB data, 2562 objects 33337 MB used, 15770 GB / 15803 GB avail 168/7686 objects degraded (2.186%) 15312 active+clean 919 active+degraded 153 active+remapped
And with type osd:
regression@plana15:/tmp/cbt/ceph/log$ ceph -s cluster 57fd3030-b7f8-4662-b115-bbbdd586c82d health HEALTH_WARN noup flag(s) set monmap e1: 1 mons at {a=10.214.144.25:6789/0}, election epoch 2, quorum 0 a osdmap e244: 18 osds: 17 up, 17 in flags noup pgmap v7461: 16384 pgs, 4 pools, 10240 MB data, 2562 objects 32200 MB used, 15771 GB / 15803 GB avail 16384 active+clean
Updated by Andrey Korolyov over 9 years ago
Can confirm placement mess on giant: I am backfilling one node from another one within two-node cluster. After today`s blackout one of osds on a receiver node went dead (leveldb/rocksdb issue) and some pgs are reported as down+peering after. With replication factor 2, this should never happen.
Updated by Sage Weil over 9 years ago
Andrey Korolyov wrote:
Can confirm placement mess on giant: I am backfilling one node from another one within two-node cluster. After today`s blackout one of osds on a receiver node went dead (leveldb/rocksdb issue) and some pgs are reported as down+peering after. With replication factor 2, this should never happen.
Haven't looked closely at this issue, but with 2x replicas down+peering peering definitely can happen. Consider a pg that maps to [0,1] normally, then [0], does some writes, then 0 fails and 1 comes up so we get [1]. In that case cannot go active because we are not up to date. I suspect if you do 'ceph pg <pgid> query' on the down+peering pg it will tell you it wants the other OSD to be up to complete peering.
Updated by Andrey Korolyov over 9 years ago
Sorry, forgot that the majority agreement does not work with two replicas. Everything is ok now.
Updated by Sage Weil over 9 years ago
- Status changed from New to Rejected
ah, it's because the vary_r tunable is false. we fixed this bug in firefly. switching to firefly tunables will resolve it.
Updated by Greg Farnum almost 7 years ago
- Project changed from Ceph to RADOS
- Category deleted (
10)