Bug #5203
Updated by Joao Eduardo Luis almost 11 years ago
Came across this one while debugging one of saaby's mon crashes.
Apparently, saaby (@ #ceph) recreated a monitor using the monmap obtained from his cluster (with a formed quorum). That monitor then went about to sync, and backed up a monmap as according to plan.
All hell then broke loose when the monitor was restarted, as the backed up monmap appears to have messed up the names:
<pre>
// Obtained from broken mon store
$ ceph-monstore-tool --mon-store-path . --key mon_sync:latest_monmap get-val --out /tmp/mon_sync.monmap
2013-05-30 15:44:52.178388 7f768d504780 -1 did not load config file, using default settings.
obtaining (mon_sync,latest_monmap)
$ monmaptool --print /tmp/mon_sync.monmap
monmaptool: monmap file /tmp/mon_sync.monmap
epoch 3
fsid ab804c03-24c1-4532-9fad-f7c1a2606aa5
last_changed 2013-05-21 16:15:04.234470
created 0.000000
0: 10.81.16.11:6789/0 mon.0
1: 10.81.30.11:6789/0 mon.1
2: 10.83.27.11:6789/0 mon.2
</pre>
Note how the backup monmap's monitor names are mon.0, mon.1 and mon.2, which seems to be according to rank. Instead, they should have been as follows:
<pre>
// Obtained from a healthier, earlier version of the store
$ ceph-monstore-tool --mon-store-path . getmonmap --out /tmp/mon_sync.monmap.02
2013-05-30 15:47:20.007503 7f2d41a0f780 -1 did not load config file, using default settings.
$ monmaptool --print /tmp/mon_sync.monmap.02
monmaptool: monmap file /tmp/mon_sync.monmap.02
epoch 3
fsid ab804c03-24c1-4532-9fad-f7c1a2606aa5
last_changed 2013-05-21 16:15:04.234470
created 0.000000
0: 10.81.16.11:6789/0 mon.ceph1-cph1c16-mon1
1: 10.81.30.11:6789/0 mon.ceph1-cph1f11-mon1
2: 10.83.27.11:6789/0 mon.ceph1-cph2i11-mon1
</pre>
These were the names that were supposed to be on the monmap.
Note how the last_changed timestamps match though.
This is the crash's log:
<pre>
root@ceph1-cph1c16-mon1:/var/lib/ceph/mon# ceph-mon -i ceph1-cph1c16-mon1 --debug-osd 20 -d
2013-05-29 13:39:23.760439 7f58c4b07780 0 ceph version 0.61.2-26-g1071736 (1071736d3b6611b6c5edeb9b225f32b4e9afdc6d), process ceph-mon, pid 22748
2013-05-29 13:39:23.933491 7f58c4b07780 0 mon.ceph1-cph1c16-mon1 does not exist in monmap, will attempt to join an existing cluster
common/config.cc: In function 'void md_config_t::set_val_or_die(const char*, const char*)' thread 7f58c4b07780 time 2013-05-29 13:39:23.933914
common/config.cc: 621: FAILED assert(ret == 0)
ceph version 0.61.2-26-g1071736 (1071736d3b6611b6c5edeb9b225f32b4e9afdc6d)
1: ceph-mon() [0x668046]
2: ceph-mon() [0x69e889]
3: (pick_addresses(CephContext*)+0x8d) [0x69e9ed]
4: (main()+0x1a6b) [0x4a146b]
5: (__libc_start_main()+0xfd) [0x7f58c2d86ead]
6: ceph-mon() [0x4a3609]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
</pre>