Project

General

Profile

Actions

Bug #62214

open

crush: a crush rule with multiple choose steps will not retry an earlier step if it chose a bucket with no in OSDs

Added by Samuel Just 9 months ago. Updated 9 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Debatably, this isn't a bug, but it does appear to be at least a counterintuitive behavior.

In the case of a crush rule like

rule replicated_rule_1 {
    ...
    step take default class hdd
    step chooseleaf firstn 3 type host
    step emit
}

We expect that if all of the osds on a particular host are marked out, mappings including those OSDs would end up on another host (provided that there are enough hosts). Indeed, that's how it works.

Consider instead a rule with two choose steps like

rule replicated_rule_1 {
    ...
    step take default class hdd
    step choose firstn 3 type host
    step choose firstn 1 type osd
    step emit
}

If we mark a single OSD down, PGs including that OSD would remap to another OSD on the same host. However, if all of the OSDs on a host are marked down, PGs mapped to that host will not be remapped and will be stuck degraded until the host is actually removed from the hierarchy or reweighted to 0.

The motivating example is actually wide EC codes on small clusters:

rule ecpool-86 {
    id 86
    type erasure
    step set_chooseleaf_tries 5
    step set_choose_tries 100
    step take default class hdd
    step choose indep 4 type host
    step chooseleaf indep 4 type osd
    step emit
}

In the above case, once a PG has shards on a host, those positions won't remap to another host unless the host is removed or reweighted to 0. See https://github.com/athanatos/ceph/tree/sjust/wip-ec-86-test-62213-62214 for some unit test examples.


Related issues 1 (1 open0 closed)

Related to RADOS - Bug #62213: crush: choose leaf with type = 0 may incorrectly map out osdsNewLaura Flores

Actions
Actions #1

Updated by Samuel Just 9 months ago

  • Assignee set to Samuel Just
Actions #2

Updated by Samuel Just 9 months ago

  • Description updated (diff)
Actions #3

Updated by Radoslaw Zarzynski 9 months ago

This tracker got added to the agenda of 8/8/2023 RADOS Team Meeting.

Actions #4

Updated by Greg Farnum 9 months ago

Yes, this assessment looks correct to me. It's a big part of why chooseleaf exists — the CRUSH state machine just doesn't have a way to back out of bad choices that aren't detected as bad until a later step. (Or that's my understanding — I've really never played around with internal crush code at all.)

I believe we do have mechanisms that update host weights, but they take some time to get triggered? Or is there some issue where reducing weights can change the mapping of items in other buckets? — I think the whole point of straw is to prevent that, so reducing a weight should never cause data to get migrated in (or between two other buckets). But maybe I'm missing something.

Actions #5

Updated by Samuel Just 9 months ago

I haven't been able to find anything in OSDMonitor that would reweight the bucket as the OSDs are marked out -- it wouldn't be necessary for single chooseleaf rules anyway. I might well be missing something though.

Actions #6

Updated by Laura Flores 8 months ago

  • Related to Bug #62213: crush: choose leaf with type = 0 may incorrectly map out osds added
Actions

Also available in: Atom PDF