Bug #5203: mon: backup monmap for sync appears to drop correct monitor names? - Ceph - Ceph

Actions

Copy link

Bug #5203

closed

mon: backup monmap for sync appears to drop correct monitor names?

Added by Joao Eduardo Luis almost 11 years ago. Updated almost 11 years ago.

Status:

Resolved

Priority:

High

Assignee:

Joao Eduardo Luis

Category:

Monitor

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Came across this one while debugging one of saaby's mon crashes.

Apparently, saaby (@ #ceph) recreated a monitor using the monmap obtained from his cluster (with a formed quorum). That monitor then went about to sync, and backed up a monmap as according to plan.

All hell then broke loose when the monitor was restarted, as the backed up monmap appears to have messed up the names:


// Obtained from broken mon store

$ ceph-monstore-tool --mon-store-path . --key mon_sync:latest_monmap get-val --out /tmp/mon_sync.monmap
2013-05-30 15:44:52.178388 7f768d504780 -1 did not load config file, using default settings.
obtaining (mon_sync,latest_monmap)

$ monmaptool --print /tmp/mon_sync.monmap 
monmaptool: monmap file /tmp/mon_sync.monmap
epoch 3
fsid ab804c03-24c1-4532-9fad-f7c1a2606aa5
last_changed 2013-05-21 16:15:04.234470
created 0.000000
0: 10.81.16.11:6789/0 mon.0
1: 10.81.30.11:6789/0 mon.1
2: 10.83.27.11:6789/0 mon.2

Note how the backup monmap's monitor names are mon.0, mon.1 and mon.2, which seems to be according to rank. Instead, they should have been as follows:


// Obtained from a healthier, earlier version of the store

$ ceph-monstore-tool --mon-store-path . getmonmap --out /tmp/mon_sync.monmap.02
2013-05-30 15:47:20.007503 7f2d41a0f780 -1 did not load config file, using default settings.

$ monmaptool --print /tmp/mon_sync.monmap.02
monmaptool: monmap file /tmp/mon_sync.monmap.02
epoch 3
fsid ab804c03-24c1-4532-9fad-f7c1a2606aa5
last_changed 2013-05-21 16:15:04.234470
created 0.000000
0: 10.81.16.11:6789/0 mon.ceph1-cph1c16-mon1
1: 10.81.30.11:6789/0 mon.ceph1-cph1f11-mon1
2: 10.83.27.11:6789/0 mon.ceph1-cph2i11-mon1

These were the names that were supposed to be on the monmap.

Note how the last_changed timestamps match though.

Actions

Copy link

Updated by Joao Eduardo Luis almost 11 years ago

Status changed from New to 12

Verified by forcing a monitor to sync and to assert out before actually synchronizing (using --mon-sync-requester-kill-at 1).

back up monmap should contain this:

ubuntu@mira050:~/joao/ceph/src$ ./ceph mon getmap -o /tmp/monmap.02
got latest monmap
ubuntu@mira050:~/joao/ceph/src$ ./monmaptool --print /tmp/monmap.02
./monmaptool: monmap file /tmp/monmap.02
epoch 1
fsid 210a6622-c1c9-4614-a79e-8aa90b13a06a
last_changed 2013-05-22 06:08:24.591552
created 2013-05-22 06:08:24.591552
0: 127.0.0.1:6789/0 mon.a
1: 127.0.0.1:6790/0 mon.b
2: 127.0.0.1:6791/0 mon.c

but instead contains

ubuntu@mira050:~/joao/ceph/src$ ./monmaptool --print /tmp/monmap.01 
./monmaptool: monmap file /tmp/monmap.01
epoch 1
fsid 210a6622-c1c9-4614-a79e-8aa90b13a06a
last_changed 2013-05-22 06:08:24.591552
created 2013-05-22 06:08:24.591552
0: 127.0.0.1:6789/0 mon.0
1: 127.0.0.1:6790/0 mon.1
2: 127.0.0.1:6791/0 mon.2

Actions

Copy link

Updated by Joao Eduardo Luis almost 11 years ago

Description updated (diff)

Edit: crash log had nothing to do with this bug. It's an entirely different issue regarding pick_addresses().

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by Joao Eduardo Luis almost 11 years ago

proposed fix in wip-5203

Actions

Copy link

Updated by Joao Eduardo Luis almost 11 years ago

Status changed from 12 to 4

Actions

Copy link

Updated by Joao Eduardo Luis almost 11 years ago

Status changed from 4 to Fix Under Review

Actions

Copy link

Updated by Denis kaganovich almost 11 years ago

Good. Looks like solution for #5171 too (unsure about all cases, but I still too distubed to remember precise - happened it on old runned system too or re-created monitor only).

PS Last issue before train: now (cuttefish git) still problem with big tumeout on "rbd snap purge". I have 2 VMs with Linux+OCFS2, which rebooting in case timeouts (I unsure, but IMHO there are OCFS2 timeouts reaction, may be in Jule if still persist - I will change "reset" fencing mode to "panic" and check it, but OCFS2's heartbeat=local). But all you must know - "rbd snap purge" is too laggy. Thanx.

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Status changed from Fix Under Review to Resolved

fix is merged, 626de387e617db457d6d431c16327c275b0e8a34, and backported to cuttlefish.

Denis, can you open a separate bug for the rbd snap purge issue? Thanks!

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #5203

mon: backup monmap for sync appears to drop correct monitor names?

Updated by Joao Eduardo Luis almost 11 years ago

Updated by Joao Eduardo Luis almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Joao Eduardo Luis almost 11 years ago

Updated by Joao Eduardo Luis almost 11 years ago

Updated by Joao Eduardo Luis almost 11 years ago

Updated by Denis kaganovich almost 11 years ago

Updated by Sage Weil almost 11 years ago