Project

General

Profile

Actions

Bug #5203

closed

mon: backup monmap for sync appears to drop correct monitor names?

Added by Joao Eduardo Luis almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
High
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Came across this one while debugging one of saaby's mon crashes.

Apparently, saaby (@ #ceph) recreated a monitor using the monmap obtained from his cluster (with a formed quorum). That monitor then went about to sync, and backed up a monmap as according to plan.

All hell then broke loose when the monitor was restarted, as the backed up monmap appears to have messed up the names:


// Obtained from broken mon store

$ ceph-monstore-tool --mon-store-path . --key mon_sync:latest_monmap get-val --out /tmp/mon_sync.monmap
2013-05-30 15:44:52.178388 7f768d504780 -1 did not load config file, using default settings.
obtaining (mon_sync,latest_monmap)

$ monmaptool --print /tmp/mon_sync.monmap 
monmaptool: monmap file /tmp/mon_sync.monmap
epoch 3
fsid ab804c03-24c1-4532-9fad-f7c1a2606aa5
last_changed 2013-05-21 16:15:04.234470
created 0.000000
0: 10.81.16.11:6789/0 mon.0
1: 10.81.30.11:6789/0 mon.1
2: 10.83.27.11:6789/0 mon.2

Note how the backup monmap's monitor names are mon.0, mon.1 and mon.2, which seems to be according to rank. Instead, they should have been as follows:


// Obtained from a healthier, earlier version of the store

$ ceph-monstore-tool --mon-store-path . getmonmap --out /tmp/mon_sync.monmap.02
2013-05-30 15:47:20.007503 7f2d41a0f780 -1 did not load config file, using default settings.

$ monmaptool --print /tmp/mon_sync.monmap.02
monmaptool: monmap file /tmp/mon_sync.monmap.02
epoch 3
fsid ab804c03-24c1-4532-9fad-f7c1a2606aa5
last_changed 2013-05-21 16:15:04.234470
created 0.000000
0: 10.81.16.11:6789/0 mon.ceph1-cph1c16-mon1
1: 10.81.30.11:6789/0 mon.ceph1-cph1f11-mon1
2: 10.83.27.11:6789/0 mon.ceph1-cph2i11-mon1

These were the names that were supposed to be on the monmap.

Note how the last_changed timestamps match though.

Actions #1

Updated by Joao Eduardo Luis almost 11 years ago

  • Status changed from New to 12

Verified by forcing a monitor to sync and to assert out before actually synchronizing (using --mon-sync-requester-kill-at 1).

back up monmap should contain this:

ubuntu@mira050:~/joao/ceph/src$ ./ceph mon getmap -o /tmp/monmap.02
got latest monmap
ubuntu@mira050:~/joao/ceph/src$ ./monmaptool --print /tmp/monmap.02
./monmaptool: monmap file /tmp/monmap.02
epoch 1
fsid 210a6622-c1c9-4614-a79e-8aa90b13a06a
last_changed 2013-05-22 06:08:24.591552
created 2013-05-22 06:08:24.591552
0: 127.0.0.1:6789/0 mon.a
1: 127.0.0.1:6790/0 mon.b
2: 127.0.0.1:6791/0 mon.c

but instead contains

ubuntu@mira050:~/joao/ceph/src$ ./monmaptool --print /tmp/monmap.01 
./monmaptool: monmap file /tmp/monmap.01
epoch 1
fsid 210a6622-c1c9-4614-a79e-8aa90b13a06a
last_changed 2013-05-22 06:08:24.591552
created 2013-05-22 06:08:24.591552
0: 127.0.0.1:6789/0 mon.0
1: 127.0.0.1:6790/0 mon.1
2: 127.0.0.1:6791/0 mon.2
Actions #2

Updated by Joao Eduardo Luis almost 11 years ago

  • Description updated (diff)

Edit: crash log had nothing to do with this bug. It's an entirely different issue regarding pick_addresses().

Actions #3

Updated by Sage Weil almost 11 years ago

  • Priority changed from Normal to High
Actions #4

Updated by Joao Eduardo Luis almost 11 years ago

proposed fix in wip-5203

Actions #5

Updated by Joao Eduardo Luis almost 11 years ago

  • Status changed from 12 to 4
Actions #6

Updated by Joao Eduardo Luis almost 11 years ago

  • Status changed from 4 to Fix Under Review
Actions #7

Updated by Denis kaganovich almost 11 years ago

Good. Looks like solution for #5171 too (unsure about all cases, but I still too distubed to remember precise - happened it on old runned system too or re-created monitor only).

PS Last issue before train: now (cuttefish git) still problem with big tumeout on "rbd snap purge". I have 2 VMs with Linux+OCFS2, which rebooting in case timeouts (I unsure, but IMHO there are OCFS2 timeouts reaction, may be in Jule if still persist - I will change "reset" fencing mode to "panic" and check it, but OCFS2's heartbeat=local). But all you must know - "rbd snap purge" is too laggy. Thanx.

Actions #8

Updated by Sage Weil almost 11 years ago

  • Status changed from Fix Under Review to Resolved

fix is merged, 626de387e617db457d6d431c16327c275b0e8a34, and backported to cuttlefish.

Denis, can you open a separate bug for the rbd snap purge issue? Thanks!

Actions

Also available in: Atom PDF