Bug #9911: ceph not placing replicas to OSDs on same host as down/out OSD - RADOS - Ceph

Actions

Copy link

Bug #9911

closed

ceph not placing replicas to OSDs on same host as down/out OSD

Added by Mark Nelson over 9 years ago. Updated almost 7 years ago.

Status:

Rejected

Priority:

Urgent

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

On a 3 node firefly cluster with 6 OSDs per host and 3x replication, when noup is set and 1 OSD is marked down/out, a number of PGs get set active+degraded and never recover with only 2 OSDs in the acting set.

The crush rule governing replication is:

[pre]
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
[/pre]

changing the rule to "type osd" fixes the problem as might be expected, though presumably "type host" should also work in this scenario. Attached is a pg dump, osd map, and crush map.

An example reproduced pg using osdmaptool also shows only 2 OSDs in the acting set:

[pre]
regression@plana15:/tmp/cbt/ceph/log$ osdmaptool /tmp/osd.map --test-map-pg 1.ffe
osdmaptool: osdmap file '/tmp/osd.map'
parsed '1.ffe' -> 1.ffe
1.ffe raw ([9,3], p9) up ([9,3], p9) acting ([9,3], p9)
[/pre]

While this cluster is firefly, I believe this can be reproduced in giant as well.

Files

crush_bug.tgz (424 KB) crush_bug.tgz

Mark Nelson, 10/27/2014 02:11 PM

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #9911

ceph not placing replicas to OSDs on same host as down/out OSD

Updated by Mark Nelson over 9 years ago

Updated by Andrey Korolyov over 9 years ago

Updated by Sage Weil over 9 years ago

Updated by Andrey Korolyov over 9 years ago

Updated by Sage Weil over 9 years ago

Updated by Greg Farnum almost 7 years ago