Project

General

Profile

Bug #5203

Updated by Joao Eduardo Luis almost 11 years ago

Came across this one while debugging one of saaby's mon crashes. 

 Apparently, saaby (@ #ceph) recreated a monitor using the monmap obtained from his cluster (with a formed quorum).    That monitor then went about to sync, and backed up a monmap as according to plan. 

 All hell then broke loose when the monitor was restarted, as the backed up monmap appears to have messed up the names: 

 <pre> 

 // Obtained from broken mon store 

 $ ceph-monstore-tool --mon-store-path . --key mon_sync:latest_monmap get-val --out /tmp/mon_sync.monmap 
 2013-05-30 15:44:52.178388 7f768d504780 -1 did not load config file, using default settings. 
 obtaining (mon_sync,latest_monmap) 

 $ monmaptool --print /tmp/mon_sync.monmap  
 monmaptool: monmap file /tmp/mon_sync.monmap 
 epoch 3 
 fsid ab804c03-24c1-4532-9fad-f7c1a2606aa5 
 last_changed 2013-05-21 16:15:04.234470 
 created 0.000000 
 0: 10.81.16.11:6789/0 mon.0 
 1: 10.81.30.11:6789/0 mon.1 
 2: 10.83.27.11:6789/0 mon.2 
 </pre> 

 Note how the backup monmap's monitor names are mon.0, mon.1 and mon.2, which seems to be according to rank.    Instead, they should have been as follows: 

 <pre> 

 // Obtained from a healthier, earlier version of the store 

 $ ceph-monstore-tool --mon-store-path . getmonmap --out /tmp/mon_sync.monmap.02 
 2013-05-30 15:47:20.007503 7f2d41a0f780 -1 did not load config file, using default settings. 

 $ monmaptool --print /tmp/mon_sync.monmap.02 
 monmaptool: monmap file /tmp/mon_sync.monmap.02 
 epoch 3 
 fsid ab804c03-24c1-4532-9fad-f7c1a2606aa5 
 last_changed 2013-05-21 16:15:04.234470 
 created 0.000000 
 0: 10.81.16.11:6789/0 mon.ceph1-cph1c16-mon1 
 1: 10.81.30.11:6789/0 mon.ceph1-cph1f11-mon1 
 2: 10.83.27.11:6789/0 mon.ceph1-cph2i11-mon1 
 </pre> 

 These were the names that were supposed to be on the monmap. 

 Note how the last_changed timestamps match though. 

 This is the crash's log: 

 <pre> 
 root@ceph1-cph1c16-mon1:/var/lib/ceph/mon# ceph-mon -i ceph1-cph1c16-mon1 --debug-osd 20 -d 
 2013-05-29 13:39:23.760439 7f58c4b07780    0 ceph version 0.61.2-26-g1071736 (1071736d3b6611b6c5edeb9b225f32b4e9afdc6d), process ceph-mon, pid 22748 
 2013-05-29 13:39:23.933491 7f58c4b07780    0 mon.ceph1-cph1c16-mon1 does not exist in monmap, will attempt to join an existing cluster 
 common/config.cc: In function 'void md_config_t::set_val_or_die(const char*, const char*)' thread 7f58c4b07780 time 2013-05-29 13:39:23.933914 
 common/config.cc: 621: FAILED assert(ret == 0) 
  ceph version 0.61.2-26-g1071736 (1071736d3b6611b6c5edeb9b225f32b4e9afdc6d) 
  1: ceph-mon() [0x668046] 
  2: ceph-mon() [0x69e889] 
  3: (pick_addresses(CephContext*)+0x8d) [0x69e9ed] 
  4: (main()+0x1a6b) [0x4a146b] 
  5: (__libc_start_main()+0xfd) [0x7f58c2d86ead] 
  6: ceph-mon() [0x4a3609] 
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 
 </pre>

Back