Bug #5203
closedmon: backup monmap for sync appears to drop correct monitor names?
0%
Description
Came across this one while debugging one of saaby's mon crashes.
Apparently, saaby (@ #ceph) recreated a monitor using the monmap obtained from his cluster (with a formed quorum). That monitor then went about to sync, and backed up a monmap as according to plan.
All hell then broke loose when the monitor was restarted, as the backed up monmap appears to have messed up the names:
// Obtained from broken mon store $ ceph-monstore-tool --mon-store-path . --key mon_sync:latest_monmap get-val --out /tmp/mon_sync.monmap 2013-05-30 15:44:52.178388 7f768d504780 -1 did not load config file, using default settings. obtaining (mon_sync,latest_monmap) $ monmaptool --print /tmp/mon_sync.monmap monmaptool: monmap file /tmp/mon_sync.monmap epoch 3 fsid ab804c03-24c1-4532-9fad-f7c1a2606aa5 last_changed 2013-05-21 16:15:04.234470 created 0.000000 0: 10.81.16.11:6789/0 mon.0 1: 10.81.30.11:6789/0 mon.1 2: 10.83.27.11:6789/0 mon.2
Note how the backup monmap's monitor names are mon.0, mon.1 and mon.2, which seems to be according to rank. Instead, they should have been as follows:
// Obtained from a healthier, earlier version of the store $ ceph-monstore-tool --mon-store-path . getmonmap --out /tmp/mon_sync.monmap.02 2013-05-30 15:47:20.007503 7f2d41a0f780 -1 did not load config file, using default settings. $ monmaptool --print /tmp/mon_sync.monmap.02 monmaptool: monmap file /tmp/mon_sync.monmap.02 epoch 3 fsid ab804c03-24c1-4532-9fad-f7c1a2606aa5 last_changed 2013-05-21 16:15:04.234470 created 0.000000 0: 10.81.16.11:6789/0 mon.ceph1-cph1c16-mon1 1: 10.81.30.11:6789/0 mon.ceph1-cph1f11-mon1 2: 10.83.27.11:6789/0 mon.ceph1-cph2i11-mon1
These were the names that were supposed to be on the monmap.
Note how the last_changed timestamps match though.
Updated by Joao Eduardo Luis almost 11 years ago
- Status changed from New to 12
Verified by forcing a monitor to sync and to assert out before actually synchronizing (using --mon-sync-requester-kill-at 1).
back up monmap should contain this:
ubuntu@mira050:~/joao/ceph/src$ ./ceph mon getmap -o /tmp/monmap.02 got latest monmap ubuntu@mira050:~/joao/ceph/src$ ./monmaptool --print /tmp/monmap.02 ./monmaptool: monmap file /tmp/monmap.02 epoch 1 fsid 210a6622-c1c9-4614-a79e-8aa90b13a06a last_changed 2013-05-22 06:08:24.591552 created 2013-05-22 06:08:24.591552 0: 127.0.0.1:6789/0 mon.a 1: 127.0.0.1:6790/0 mon.b 2: 127.0.0.1:6791/0 mon.c
but instead contains
ubuntu@mira050:~/joao/ceph/src$ ./monmaptool --print /tmp/monmap.01 ./monmaptool: monmap file /tmp/monmap.01 epoch 1 fsid 210a6622-c1c9-4614-a79e-8aa90b13a06a last_changed 2013-05-22 06:08:24.591552 created 2013-05-22 06:08:24.591552 0: 127.0.0.1:6789/0 mon.0 1: 127.0.0.1:6790/0 mon.1 2: 127.0.0.1:6791/0 mon.2
Updated by Joao Eduardo Luis almost 11 years ago
- Description updated (diff)
Edit: crash log had nothing to do with this bug. It's an entirely different issue regarding pick_addresses().
Updated by Joao Eduardo Luis almost 11 years ago
- Status changed from 4 to Fix Under Review
Updated by Denis kaganovich almost 11 years ago
Good. Looks like solution for #5171 too (unsure about all cases, but I still too distubed to remember precise - happened it on old runned system too or re-created monitor only).
PS Last issue before train: now (cuttefish git) still problem with big tumeout on "rbd snap purge". I have 2 VMs with Linux+OCFS2, which rebooting in case timeouts (I unsure, but IMHO there are OCFS2 timeouts reaction, may be in Jule if still persist - I will change "reset" fencing mode to "panic" and check it, but OCFS2's heartbeat=local). But all you must know - "rbd snap purge" is too laggy. Thanx.
Updated by Sage Weil almost 11 years ago
- Status changed from Fix Under Review to Resolved
fix is merged, 626de387e617db457d6d431c16327c275b0e8a34, and backported to cuttlefish.
Denis, can you open a separate bug for the rbd snap purge issue? Thanks!