Project

General

Profile

Actions

Bug #5343

closed

mon: infinite OSDMonitor::update_from_paxos() on single-monitor setup

Added by Joao Eduardo Luis almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
High
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

A user on ceph-users shared a log containing a most interesting behavior happening on OSDMonitor::update_from_paxos(), during which the same incremental is applied over and over again. From his log, this is happening on a single-monitor setup.

The monitor starts nicely, creating the OSDMonitor's pending value for epoch 2810.

Eventually, MDSMonitor::fail_mds_gid() triggers a cross-proposal to the OSDMonitor. A full map for epoch 2809 is encoded, along with the 2810 incremental:

2013-06-13 11:36:42.114193 7f5894e25700 10 mon.a@0(leader).mds e136 preprocess_query mdsbeacon(45703/a up:boot seq 1 v0) v2 from mds.? 10.255.0.25:6800/10106
2013-06-13 11:36:42.114205 7f5894e25700 12 mon.a@0(leader).mds e136 preprocess_beacon mdsbeacon(45703/a up:boot seq 1 v0) v2 from mds.? 10.255.0.25:6800/10106 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
2013-06-13 11:36:42.114219 7f5894e25700  7 mon.a@0(leader).mds e136 prepare_update mdsbeacon(45703/a up:boot seq 1 v0) v2
2013-06-13 11:36:42.114222 7f5894e25700 12 mon.a@0(leader).mds e136 prepare_beacon mdsbeacon(45703/a up:boot seq 1 v0) v2 from mds.? 10.255.0.25:6800/10106
2013-06-13 11:36:42.114229 7f5894e25700 10 mon.a@0(leader).mds e136 fail_mds_gid 45404 mds.a rank 0
2013-06-13 11:36:42.114234 7f5894e25700 10 mon.a@0(leader).osd e2809 blacklist 10.255.0.25:6800/11482 until 2013-06-13 12:00:42.114233
2013-06-13 11:36:42.114467 7f5894e25700 10 mon.a@0(leader).osd e2809  min_last_epoch_clean 2796
2013-06-13 11:36:42.114590 7f5894e25700 10 mon.a@0(leader).osd e2809 encode_full osdmap e 2809
2013-06-13 11:36:42.114734 7f5894e25700 10 mon.a@0(leader).osd e2809  min_last_epoch_clean 2796
2013-06-13 11:36:42.114822 7f5894e25700 10 mon.a@0(leader).osd e2809 encode_pending e 2810

Being on a single-monitor setup, this proposal is swiftly committed and OSDMonitor::update_from_paxos() is called. We are then caught in a loop that goes on forever:

2013-06-13 11:36:42.452660 7f5894e25700 15 mon.a@0(leader).osd e2809 update_from_paxos paxos e 2810, my e 2809
2013-06-13 11:36:42.452723 7f5894e25700  7 mon.a@0(leader).osd e2809 update_from_paxos  applying incremental 2810
2013-06-13 11:36:42.452783 7f5894e25700  1 mon.a@0(leader).osd e2809 e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:42.452808 7f5894e25700  7 mon.a@0(leader).osd e2809 update_from_paxos  applying incremental 2810
2013-06-13 11:36:42.452849 7f5894e25700  1 mon.a@0(leader).osd e2809 e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:42.452872 7f5894e25700  7 mon.a@0(leader).osd e2809 update_from_paxos  applying incremental 2810
2013-06-13 11:36:42.452913 7f5894e25700  1 mon.a@0(leader).osd e2809 e2809: 9 osds: 9 up, 9 in

From the code we are able to deduce that this will only happen if the in-memory osdmap version is not increased when we apply the incremental. This also impairs the monitor from progressing. Despite having the on-disk osdmap:last_committed set to 2810, and despite constantly applying it to the in-memory map, the OSDMonitor's epoch never reaches epoch 2810, as made clear in the log. Restarting the monitor doesn't help either.

After being restarted, this is what happens:

2013-06-13 16:30:49.291580 7ffe4b2e3780  1 mon.a@-1(probing).osd e2809 e2809: 9 osds: 9 up, 9 in
2013-06-13 16:30:49.291602 7ffe4b2e3780  7 mon.a@-1(probing).osd e2809 update_from_paxos  applying incremental 2810

[restart happens here]

2013-06-13 16:43:30.656501 7f544eae5780  0 ceph version 0.61.3 (92b1e398576d55df8e5888dd1a9545ed3fd99532), process ceph-mon, pid 19554
2013-06-13 16:43:30.848864 7f544eae5780  1 mon.a@-1(probing) e12 preinit fsid d459cc6e-61b4-41e5-8987-242e12fd760e
2013-06-13 16:43:30.875786 7f544eae5780  1 mon.a@-1(probing).osd e2809 e2809: 9 osds: 9 up, 9 in
2013-06-13 16:43:30.875858 7f544eae5780  1 mon.a@-1(probing).osd e2809 e2809: 9 osds: 9 up, 9 in

[continues as such till the end of the log]
Actions

Also available in: Atom PDF