Support #8600
closed
MON crashes on new crushmap injection
Added by Jean-Charles Lopez almost 10 years ago.
Updated almost 4 years ago.
Category:
Correctness/Safety
Tags:
monitor crush segfault
Description
The crush map contains the following rule
rule ssd {
ruleset 1
type replicated
min_size 1
max_size 10
step take ssd
step chooseleaf firstn 0 type rack
step chooseleaf firstn 0 type host
step emit
}
crushtool compiles the map with no warning nor error.
When the new map is injected into the cluster, it causes the MON to segfault.
Restarting the faulted MON brings cluster back to norma operation mode.
Issue can be reproduced at will
Files
ceph-mon.log (89.3 KB)
ceph-mon.log |
ceph-mon log while injecting the map |
Jean-Charles Lopez, 06/14/2014 01:44 PM
|
|
cmbad.txt (2.67 KB)
cmbad.txt |
map that can compile and containing above directives |
Jean-Charles Lopez, 06/14/2014 01:44 PM
|
|
- Assignee set to Joao Eduardo Luis
- Priority changed from Normal to High
JC, although we don't have a fix for the crash yet (we shouldn't crash if a crushmap is incorrectly structured), there's an easy way to avoid the crash.
Basically there's two things to note:
1. those chooseleaf's on rule 'ssd' and rule 'hdd' aren't doing what you thing they're doing, as they'll first grab leaves from 'rack' and then they'll grab leaves from 'host'.
2. what you probably want is a 'choose ... rack' and then 'chooseleaf ... host'.
Removing the 'chooseleaf ... rack' before the host, or the 'chooseleaf ... host' after the rack will avoid the crash. Changing 'chooseleaf ... rack' to 'choose ... rack' will also avoid the crash.
In addition to the choose vs. chooseleaf issue that Joao is mentioning here, we have also seen problems when min_size is lower than what a rule actually requires.
rule crashtest {
...
min_size=1
step chooseleaf firstn 2 type rack
step emit
}
This at least causes crushtool --test to segfault, so not 100% sure if the MON bails on this too.
- Project changed from Ceph to RADOS
- Category deleted (
Monitor)
- Component(RADOS) Monitor added
- Category set to Correctness/Safety
- Status changed from New to Closed
- Assignee deleted (
Joao Eduardo Luis)
closing because no one has complained for 6 years.
Also available in: Atom
PDF