Bug #11814
Updated by Loïc Dachary almost 9 years ago
h3. Context
* RHEL6
* Hammer 0.94.1
* 3 Mons
* 315 OSDs
h3. Steps to reproduce (crushmap is attached to the ticket)
* $profile = k8m4isa
* ceph osd erasure-code-profile set k8m4isa plugin=isa k=8 m=4 technique=reed_sol_van ruleset-root=bigbang ruleset-failure-domain=host
* $pool = castor-ec-isa
* ceph osd pool create $pool 4096 4096 erasure k8m4isa castor-ec-isa
The mon should crash instantaneously, we got this backtrace:
<pre>
#0 crush_choose_indep (map=0x363fbc0, bucket=0x0, weight=0x36fa300, weight_max=315, x=-733087052, left=12, numrep=12, type=1, out=0x7fffffffbcb0, outpos=0, tries=100, recurse_tries=5, recurse_to_leaf=1, out2=0x7fffffffbce0, parent_r=0) at crush/mapper.c:664
#1 0x000000000079ec61 in crush_do_rule (map=0x363fbc0, ruleno=<value optimized out>, x=-733087052, result=0x7fffffffbd20, result_max=12, weight=0x36fa300, weight_max=315, scratch=0x7fffffffbc80) at crush/mapper.c:930
#2 0x000000000080cdc5 in CrushWrapper::do_rule (this=<value optimized out>, rule=10, x=-733087052, out=std::vector of length 0, capacity 0, maxout=12, weight=std::vector of length 315, capacity 315 = {...}) at crush/CrushWrapper.h:1025
#3 0x0000000000836c06 in OSDMap::_pg_to_osds (this=0x3888988, pool=..., pg=..., osds=0x7fffffffbea0, primary=0x7fffffffbecc, ppps=0x7fffffffbec4) at osd/OSDMap.cc:1521
#4 0x0000000000837044 in OSDMap::_pg_to_up_acting_osds (this=0x3888988, pg=..., up=0x7fffffffc330, up_primary=0x7fffffffc36c, acting=0x7fffffffc0d0, acting_primary=0x7fffffffc368) at osd/OSDMap.cc:1702
#5 0x000000000065c154 in pg_to_up_acting_osds (this=0x3740e00) at osd/OSDMap.h:677
#6 PGMonitor::map_pg_creates (this=0x3740e00) at mon/PGMonitor.cc:1127
#7 0x000000000065cd7d in PGMonitor::post_paxos_update (this=0x3740e00) at mon/PGMonitor.cc:311
#8 0x0000000000583431 in Monitor::refresh_from_paxos (this=0x3878000, need_bootstrap=0x0) at mon/Monitor.cc:791
#9 0x00000000005836d5 in Monitor::init_paxos (this=0x3878000) at mon/Monitor.cc:766
#10 0x000000000059a411 in Monitor::preinit (this=0x3878000) at mon/Monitor.cc:651
#11 0x000000000055519a in main (argc=<value optimized out>, argv=0x36b00b0) at ceph_mon.cc:731
</pre>
In fact I just noticed that the (probable) cause of the crash is that we created the erasure-code-profile with a ruleset-root=bigbang but this root has been decommisionned. If I noticed that before I would have fixed this parameter and then the MONs wouldn't have crashed (as far as I can tell).