Bug #9911: ceph not placing replicas to OSDs on same host as down/out OSD - RADOS - Ceph

Actions

Copy link

Bug #9911

closed

ceph not placing replicas to OSDs on same host as down/out OSD

Added by Mark Nelson over 9 years ago. Updated almost 7 years ago.

Status:

Rejected

Priority:

Urgent

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

On a 3 node firefly cluster with 6 OSDs per host and 3x replication, when noup is set and 1 OSD is marked down/out, a number of PGs get set active+degraded and never recover with only 2 OSDs in the acting set.

The crush rule governing replication is:

[pre]
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
[/pre]

changing the rule to "type osd" fixes the problem as might be expected, though presumably "type host" should also work in this scenario. Attached is a pg dump, osd map, and crush map.

An example reproduced pg using osdmaptool also shows only 2 OSDs in the acting set:

[pre]
regression@plana15:/tmp/cbt/ceph/log$ osdmaptool /tmp/osd.map --test-map-pg 1.ffe
osdmaptool: osdmap file '/tmp/osd.map'
parsed '1.ffe' -> 1.ffe
1.ffe raw ([9,3], p9) up ([9,3], p9) acting ([9,3], p9)
[/pre]

While this cluster is firefly, I believe this can be reproduced in giant as well.

Files

crush_bug.tgz (424 KB) crush_bug.tgz

Mark Nelson, 10/27/2014 02:11 PM

Actions

Copy link

Updated by Mark Nelson over 9 years ago

ceph -s output with an OSD down and type host:

regression@plana15:/tmp/cbt/ceph/log$ ceph -s
    cluster 57fd3030-b7f8-4662-b115-bbbdd586c82d
     health HEALTH_WARN 919 pgs degraded; 40 pgs stuck unclean; recovery 168/7686 objects degraded (2.186%); noup flag(s) set
     monmap e1: 1 mons at {a=10.214.144.25:6789/0}, election epoch 2, quorum 0 a
     osdmap e258: 18 osds: 17 up, 17 in
            flags noup
      pgmap v7539: 16384 pgs, 4 pools, 10240 MB data, 2562 objects
            33337 MB used, 15770 GB / 15803 GB avail
            168/7686 objects degraded (2.186%)
               15312 active+clean
                 919 active+degraded
                 153 active+remapped

And with type osd:

regression@plana15:/tmp/cbt/ceph/log$ ceph -s
    cluster 57fd3030-b7f8-4662-b115-bbbdd586c82d
     health HEALTH_WARN noup flag(s) set
     monmap e1: 1 mons at {a=10.214.144.25:6789/0}, election epoch 2, quorum 0 a
     osdmap e244: 18 osds: 17 up, 17 in
            flags noup
      pgmap v7461: 16384 pgs, 4 pools, 10240 MB data, 2562 objects
            32200 MB used, 15771 GB / 15803 GB avail
               16384 active+clean

Actions

Copy link

Updated by Andrey Korolyov over 9 years ago

Can confirm placement mess on giant: I am backfilling one node from another one within two-node cluster. After today`s blackout one of osds on a receiver node went dead (leveldb/rocksdb issue) and some pgs are reported as down+peering after. With replication factor 2, this should never happen.

Actions

Copy link

Updated by Sage Weil over 9 years ago

Andrey Korolyov wrote:

Can confirm placement mess on giant: I am backfilling one node from another one within two-node cluster. After today`s blackout one of osds on a receiver node went dead (leveldb/rocksdb issue) and some pgs are reported as down+peering after. With replication factor 2, this should never happen.

Haven't looked closely at this issue, but with 2x replicas down+peering peering definitely can happen. Consider a pg that maps to [0,1] normally, then [0], does some writes, then 0 fails and 1 comes up so we get [1]. In that case cannot go active because we are not up to date. I suspect if you do 'ceph pg <pgid> query' on the down+peering pg it will tell you it wants the other OSD to be up to complete peering.

Actions

Copy link