Project

General

Profile

Actions

Bug #9911

closed

ceph not placing replicas to OSDs on same host as down/out OSD

Added by Mark Nelson over 9 years ago. Updated almost 7 years ago.

Status:
Rejected
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On a 3 node firefly cluster with 6 OSDs per host and 3x replication, when noup is set and 1 OSD is marked down/out, a number of PGs get set active+degraded and never recover with only 2 OSDs in the acting set.

The crush rule governing replication is:

[pre]
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
[/pre]

changing the rule to "type osd" fixes the problem as might be expected, though presumably "type host" should also work in this scenario. Attached is a pg dump, osd map, and crush map.

An example reproduced pg using osdmaptool also shows only 2 OSDs in the acting set:

[pre]
regression@plana15:/tmp/cbt/ceph/log$ osdmaptool /tmp/osd.map --test-map-pg 1.ffe
osdmaptool: osdmap file '/tmp/osd.map'
parsed '1.ffe' -> 1.ffe
1.ffe raw ([9,3], p9) up ([9,3], p9) acting ([9,3], p9)
[/pre]

While this cluster is firefly, I believe this can be reproduced in giant as well.


Files

crush_bug.tgz (424 KB) crush_bug.tgz Mark Nelson, 10/27/2014 02:11 PM
Actions

Also available in: Atom PDF