Project

General

Profile

Actions

Bug #5343

closed

mon: infinite OSDMonitor::update_from_paxos() on single-monitor setup

Added by Joao Eduardo Luis almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
Joao Eduardo Luis
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

A user on ceph-users shared a log containing a most interesting behavior happening on OSDMonitor::update_from_paxos(), during which the same incremental is applied over and over again. From his log, this is happening on a single-monitor setup.

The monitor starts nicely, creating the OSDMonitor's pending value for epoch 2810.

Eventually, MDSMonitor::fail_mds_gid() triggers a cross-proposal to the OSDMonitor. A full map for epoch 2809 is encoded, along with the 2810 incremental:

2013-06-13 11:36:42.114193 7f5894e25700 10 mon.a@0(leader).mds e136 preprocess_query mdsbeacon(45703/a up:boot seq 1 v0) v2 from mds.? 10.255.0.25:6800/10106
2013-06-13 11:36:42.114205 7f5894e25700 12 mon.a@0(leader).mds e136 preprocess_beacon mdsbeacon(45703/a up:boot seq 1 v0) v2 from mds.? 10.255.0.25:6800/10106 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
2013-06-13 11:36:42.114219 7f5894e25700  7 mon.a@0(leader).mds e136 prepare_update mdsbeacon(45703/a up:boot seq 1 v0) v2
2013-06-13 11:36:42.114222 7f5894e25700 12 mon.a@0(leader).mds e136 prepare_beacon mdsbeacon(45703/a up:boot seq 1 v0) v2 from mds.? 10.255.0.25:6800/10106
2013-06-13 11:36:42.114229 7f5894e25700 10 mon.a@0(leader).mds e136 fail_mds_gid 45404 mds.a rank 0
2013-06-13 11:36:42.114234 7f5894e25700 10 mon.a@0(leader).osd e2809 blacklist 10.255.0.25:6800/11482 until 2013-06-13 12:00:42.114233
2013-06-13 11:36:42.114467 7f5894e25700 10 mon.a@0(leader).osd e2809  min_last_epoch_clean 2796
2013-06-13 11:36:42.114590 7f5894e25700 10 mon.a@0(leader).osd e2809 encode_full osdmap e 2809
2013-06-13 11:36:42.114734 7f5894e25700 10 mon.a@0(leader).osd e2809  min_last_epoch_clean 2796
2013-06-13 11:36:42.114822 7f5894e25700 10 mon.a@0(leader).osd e2809 encode_pending e 2810

Being on a single-monitor setup, this proposal is swiftly committed and OSDMonitor::update_from_paxos() is called. We are then caught in a loop that goes on forever:

2013-06-13 11:36:42.452660 7f5894e25700 15 mon.a@0(leader).osd e2809 update_from_paxos paxos e 2810, my e 2809
2013-06-13 11:36:42.452723 7f5894e25700  7 mon.a@0(leader).osd e2809 update_from_paxos  applying incremental 2810
2013-06-13 11:36:42.452783 7f5894e25700  1 mon.a@0(leader).osd e2809 e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:42.452808 7f5894e25700  7 mon.a@0(leader).osd e2809 update_from_paxos  applying incremental 2810
2013-06-13 11:36:42.452849 7f5894e25700  1 mon.a@0(leader).osd e2809 e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:42.452872 7f5894e25700  7 mon.a@0(leader).osd e2809 update_from_paxos  applying incremental 2810
2013-06-13 11:36:42.452913 7f5894e25700  1 mon.a@0(leader).osd e2809 e2809: 9 osds: 9 up, 9 in

From the code we are able to deduce that this will only happen if the in-memory osdmap version is not increased when we apply the incremental. This also impairs the monitor from progressing. Despite having the on-disk osdmap:last_committed set to 2810, and despite constantly applying it to the in-memory map, the OSDMonitor's epoch never reaches epoch 2810, as made clear in the log. Restarting the monitor doesn't help either.

After being restarted, this is what happens:

2013-06-13 16:30:49.291580 7ffe4b2e3780  1 mon.a@-1(probing).osd e2809 e2809: 9 osds: 9 up, 9 in
2013-06-13 16:30:49.291602 7ffe4b2e3780  7 mon.a@-1(probing).osd e2809 update_from_paxos  applying incremental 2810

[restart happens here]

2013-06-13 16:43:30.656501 7f544eae5780  0 ceph version 0.61.3 (92b1e398576d55df8e5888dd1a9545ed3fd99532), process ceph-mon, pid 19554
2013-06-13 16:43:30.848864 7f544eae5780  1 mon.a@-1(probing) e12 preinit fsid d459cc6e-61b4-41e5-8987-242e12fd760e
2013-06-13 16:43:30.875786 7f544eae5780  1 mon.a@-1(probing).osd e2809 e2809: 9 osds: 9 up, 9 in
2013-06-13 16:43:30.875858 7f544eae5780  1 mon.a@-1(probing).osd e2809 e2809: 9 osds: 9 up, 9 in

[continues as such till the end of the log]
Actions #1

Updated by Joao Eduardo Luis almost 11 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Joao Eduardo Luis almost 11 years ago

Running gdb, looks like the 2810's incremental fsid is different from the OSDMap's fsid:

(gdb) p inc.fsid
$2 = {uuid = "\365.\323\032\312dH\266\275a\342\031)\230\315/"}
(gdb) p fsid
$3 = {uuid = "\324Y\314na\264A$.\022\375v\016"}

Now to figure what may be behind this. Also, we should have been checking osdmap.apply_incremental()'s return and we're not -- this is why we're stuck in the infinite loop: apply_incremental() fails every single time, the osdmap's epoch is not updated, and the loop condition (2810 > 2809) is forever true.

Actions #3

Updated by Joao Eduardo Luis almost 11 years ago

Greg pointed out that it's likely the fsid issue results from messing around with the monmap's fsid. Setting up a test for that.

Actions #4

Updated by Joao Eduardo Luis almost 11 years ago

  • Status changed from In Progress to 4

I ran the following test for an already existing single-monitor setup:

  • generate monmap with random fsid
  • inject it
  • start the monitor
  • force for it to break / meet the conditions for the infinite loop

Basically, the new incremental will have the new random fsid vs the osdmap's original fsid (which matches the cluster_uuid and previous monmap's). There is no going back from this unless we change the incremental's fsid or we get rid of the incremental altogether.

After talking with Sage, considering we now understand how we got here (injected monmap had a bad fsid and the monitor didn't exit due to the prefork bug as it should have), and that once the prefork bug's fix is backported to cuttlefish (Real Soon Now) we won't be exposed to this sort of issue, we are not attempting to circumvent this issue.

A fix for update_from_paxos()'s infinite loop will be pushed shortly.

Actions #5

Updated by Sage Weil almost 11 years ago

  • Status changed from 4 to Pending Backport

pushed.. will backport once we have done more testing

Actions #6

Updated by Sage Weil almost 11 years ago

Can you generate and attach a udev log after the reboot? Actually, ideally,

- reboot
- note the time
- run partprobe

and send the log along so i can tell what activity resulted from partprobe (tho i guess it'll be obvious).

Thanks! (Also if you'r eon irc right now that'd be quicker to debug this.. #ceph on irc.oftc.net)

Actions #7

Updated by Joao Eduardo Luis almost 11 years ago

Sage, was that reply intended for this ticket? If it was I'm surely missing something...

Actions #8

Updated by Sage Weil almost 11 years ago

nope wrong ticket; ignore

Actions #9

Updated by Sage Weil almost 11 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF