Project

General

Profile

Actions

Bug #62213

open

crush: choose leaf with type = 0 may incorrectly map out osds

Added by Samuel Just 9 months ago. Updated 8 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The motivating example was:

rule ecpool-86 {
id 86
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step choose indep 4 type host
step chooseleaf indep 4 type osd
step emit
}

If all of the OSDs in a host are marked out, crush_choose_indep on a leaf bucket with recurse_to_leaf=1 (chooseleaf rather than choose) will populate out2 before the is_out check:

                if (recurse_to_leaf) {
                    if (item < 0) {
                        crush_choose_indep(
                            map,
                            work,
                            map->buckets[-1-item],
                            weight, weight_max,
                            x, 1, numrep, 0,
                            out2, rep,
                            recurse_tries, 0,
                            0, NULL, r, choose_args);
                        if (out2 && out2[rep] == CRUSH_ITEM_NONE) {
                            /* placed nothing; no leaf */
                            break;
                        }
                    } else if (out2) {
                        /* we already have a leaf! */
                        out2[rep] = item;
                    }
                }

                /* out? */
                if (itemtype == 0 &&
                    is_out(map, weight, weight_max, item, x))
                    break;

                /* yay! */
                out[rep] = item;
                left--;
                break;

If it exhausts retries (ftotal >= tries), out osds placed into out2 will still be there upon return to the caller resulting in out osds in the mapped set.

chooseleaf with any type other than osd won't trigger this bug.

This issue can be worked around by using choose rather than chooseleaf (the behavior should otherwise be the same).

It's hard to predict the impact of this bug in a real cluster. There are relatively few clusters with multiple choose steps like this. If the out OSDs are also down, they'll be filtered out later anyway. Nevertheless, it's probably worth avoiding as CRUSH isn't supposed to be able to map out OSDs.

This is a bug in CRUSH itself, so we can't simply patch the behavior. It'll need a tunable gated on client capability, etc.

A short term mitigation might be to reject the creation of crush rules like this with an error message to use the choose variant instead along with a warning for any pre-existing rules like this.

See https://github.com/athanatos/ceph/tree/sjust/wip-ec-86-test-62213-62214 for a unit test reproducer.


Related issues 1 (1 open0 closed)

Related to RADOS - Bug #62214: crush: a crush rule with multiple choose steps will not retry an earlier step if it chose a bucket with no in OSDsNewSamuel Just

Actions
Actions #1

Updated by Samuel Just 9 months ago

  • Description updated (diff)
Actions #2

Updated by Samuel Just 9 months ago

  • Description updated (diff)
Actions #3

Updated by Samuel Just 9 months ago

  • Description updated (diff)
Actions #4

Updated by Radoslaw Zarzynski 9 months ago

This tracker got added to the agenda of 8/8/2023 RADOS Team Meeting.

Actions #5

Updated by Laura Flores 8 months ago

  • Assignee changed from Samuel Just to Laura Flores
Actions #6

Updated by Radoslaw Zarzynski 8 months ago

Bump up.

Actions #7

Updated by Laura Flores 8 months ago

  • Related to Bug #62214: crush: a crush rule with multiple choose steps will not retry an earlier step if it chose a bucket with no in OSDs added
Actions

Also available in: Atom PDF