Feature #55169
open
crush: should validate rule outputs osds
Added by Dan van der Ster about 2 years ago.
Updated about 1 year ago.
Description
In this thread https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/2ZUJN75RLL4YYD4EHAUS5I4IL37A7UUL/ a user suffered a multi day outage, with down PGs and OSDs crashing due to "start interval does not contain the required bound".
After a long story, the root cause was found to be that the user had injected a crush rule that had "choose" instead of "chooseleaf".
rule csd-data-pool {
id 5
type erasure
min_size 3
max_size 5
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class big
step choose indep 0 type host <--- HERE!
step emit
}
Can we add better validation to prevent such mistakes?
- Tracker changed from Bug to Feature
- Tags set to low-hanging-fruit
Adding the extra check makes sense, I think. Implementing the patch would be a low-hanging-fruit but reviewing will not.
- Translation missing: en.field_tag_list set to low-hanging-fruit
- Tags deleted (
low-hanging-fruit)
- Status changed from New to Need More Info
- Assignee set to Shreyansh Sancheti
Shreyansh Sancheti wrote:
Need more info on this!
Sure, what do you need to know?
Dan van der Ster wrote:
Shreyansh Sancheti wrote:
Need more info on this!
Sure, what do you need to know?
So, if I am getting this correct "start interval does not contain the required bound" it doesn't have range ? Also do you have that thread handy? It is not accessible anymore at least for me.
Shreyansh Sancheti wrote:
Dan van der Ster wrote:
Shreyansh Sancheti wrote:
Need more info on this!
Sure, what do you need to know?
So, if I am getting this correct "start interval does not contain the required bound" it doesn't have range ? Also do you have that thread handy? It is not accessible anymore at least for me.
The original thread is here: https://www.mail-archive.com/ceph-users@ceph.io/msg15021.html
The root cause was found to be using "choose" in the crush rule, which is wrong -- it should be "chooseleaf".
So the proposed fix would be to validate crush rules (e.g. during ceph osd setcrushmap). There is already quite a lot of validation done -- parsing check, smoke test. So IMHO the work here would be to debug why the existing smoke test (mon_osd_crush_smoke_test) doesn't catch this issue.
- Status changed from Need More Info to In Progress
Also available in: Atom
PDF