Project

General

Profile

Actions

Feature #55169

open

crush: should validate rule outputs osds

Added by Dan van der Ster about 2 years ago. Updated over 1 year ago.

Status:
In Progress
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Component(RADOS):
CRUSH
Pull request ID:

Description

In this thread https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/2ZUJN75RLL4YYD4EHAUS5I4IL37A7UUL/ a user suffered a multi day outage, with down PGs and OSDs crashing due to "start interval does not contain the required bound".

After a long story, the root cause was found to be that the user had injected a crush rule that had "choose" instead of "chooseleaf".

rule csd-data-pool {
         id 5
         type erasure
         min_size 3
         max_size 5
         step set_chooseleaf_tries 5
         step set_choose_tries 100
         step take default class big
         step choose indep 0 type host    <--- HERE!
         step emit
}

Can we add better validation to prevent such mistakes?

Actions #1

Updated by Radoslaw Zarzynski about 2 years ago

  • Tracker changed from Bug to Feature
  • Tags set to low-hanging-fruit

Adding the extra check makes sense, I think. Implementing the patch would be a low-hanging-fruit but reviewing will not.

Actions #2

Updated by Laura Flores almost 2 years ago

  • Translation missing: en.field_tag_list set to low-hanging-fruit
  • Tags deleted (low-hanging-fruit)
Actions #3

Updated by Shreyansh Sancheti over 1 year ago

  • Status changed from New to Need More Info
  • Assignee set to Shreyansh Sancheti

Need more info on this!

Actions #4

Updated by Dan van der Ster over 1 year ago

Shreyansh Sancheti wrote:

Need more info on this!

Sure, what do you need to know?

Actions #5

Updated by Shreyansh Sancheti over 1 year ago

Dan van der Ster wrote:

Shreyansh Sancheti wrote:

Need more info on this!

Sure, what do you need to know?

So, if I am getting this correct "start interval does not contain the required bound" it doesn't have range ? Also do you have that thread handy? It is not accessible anymore at least for me.

Actions #6

Updated by Dan van der Ster over 1 year ago

Shreyansh Sancheti wrote:

Dan van der Ster wrote:

Shreyansh Sancheti wrote:

Need more info on this!

Sure, what do you need to know?

So, if I am getting this correct "start interval does not contain the required bound" it doesn't have range ? Also do you have that thread handy? It is not accessible anymore at least for me.

The original thread is here: https://www.mail-archive.com/ceph-users@ceph.io/msg15021.html
The root cause was found to be using "choose" in the crush rule, which is wrong -- it should be "chooseleaf".

So the proposed fix would be to validate crush rules (e.g. during ceph osd setcrushmap). There is already quite a lot of validation done -- parsing check, smoke test. So IMHO the work here would be to debug why the existing smoke test (mon_osd_crush_smoke_test) doesn't catch this issue.

Actions #7

Updated by Shreyansh Sancheti over 1 year ago

  • Status changed from Need More Info to In Progress
Actions

Also available in: Atom PDF