Feature #55169
opencrush: should validate rule outputs osds
0%
Description
In this thread https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/2ZUJN75RLL4YYD4EHAUS5I4IL37A7UUL/ a user suffered a multi day outage, with down PGs and OSDs crashing due to "start interval does not contain the required bound".
After a long story, the root cause was found to be that the user had injected a crush rule that had "choose" instead of "chooseleaf".
rule csd-data-pool { id 5 type erasure min_size 3 max_size 5 step set_chooseleaf_tries 5 step set_choose_tries 100 step take default class big step choose indep 0 type host <--- HERE! step emit }
Can we add better validation to prevent such mistakes?
Updated by Radoslaw Zarzynski about 2 years ago
- Tracker changed from Bug to Feature
- Tags set to low-hanging-fruit
Adding the extra check makes sense, I think. Implementing the patch would be a low-hanging-fruit but reviewing will not.
Updated by Laura Flores almost 2 years ago
- Translation missing: en.field_tag_list set to low-hanging-fruit
- Tags deleted (
low-hanging-fruit)
Updated by Shreyansh Sancheti about 1 year ago
- Status changed from New to Need More Info
- Assignee set to Shreyansh Sancheti
Need more info on this!
Updated by Dan van der Ster about 1 year ago
Shreyansh Sancheti wrote:
Need more info on this!
Sure, what do you need to know?
Updated by Shreyansh Sancheti about 1 year ago
Dan van der Ster wrote:
Shreyansh Sancheti wrote:
Need more info on this!
Sure, what do you need to know?
So, if I am getting this correct "start interval does not contain the required bound" it doesn't have range ? Also do you have that thread handy? It is not accessible anymore at least for me.
Updated by Dan van der Ster about 1 year ago
Shreyansh Sancheti wrote:
Dan van der Ster wrote:
Shreyansh Sancheti wrote:
Need more info on this!
Sure, what do you need to know?
So, if I am getting this correct "start interval does not contain the required bound" it doesn't have range ? Also do you have that thread handy? It is not accessible anymore at least for me.
The original thread is here: https://www.mail-archive.com/ceph-users@ceph.io/msg15021.html
The root cause was found to be using "choose" in the crush rule, which is wrong -- it should be "chooseleaf".
So the proposed fix would be to validate crush rules (e.g. during ceph osd setcrushmap). There is already quite a lot of validation done -- parsing check, smoke test. So IMHO the work here would be to debug why the existing smoke test (mon_osd_crush_smoke_test) doesn't catch this issue.
Updated by Shreyansh Sancheti about 1 year ago
- Status changed from Need More Info to In Progress