Project

General

Profile

Actions

Bug #4784

closed

Two Monitors Concurrently Reporting as Leaders

Added by Mike Dawson almost 11 years ago. Updated almost 11 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
Category:
Monitor
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

There appears to be a bug in the new Monitor Paxos code in version 0.59 and 0.60. Over the past several days, I have seen several instances where there is a quorum between two of my three monitors, and the other will never rejoin. At different times, each of my monitors has been stuck out of quorum.

In the logs below, I show a snippet of the repeating pattern.

State 1: mon.a (leader), mon.b (leader), mon.c (probing)
State 2: mon.a (leader), mon.b (electing), mon.c(electing)
State 3: mon.a (leader), mon.b (leader), mon.c(peon)

mon.a (leader), mon.b (leader), mon.c (probing)
-----------------------------------------------------
root@host:~# pdsh -w node[2,26,49] tail -10 /var/log/ceph/ceph-mon*.log

node2: 2013-04-22 18:34:54.819899 7f92135d7700 10 mon.a@0(leader) e1 ms_verify_authorizer 10.1.0.5:0/1023269 client protocol 0
node2: 2013-04-22 18:34:54.820169 7f92131d3700 10 mon.a@0(leader) e1 ms_verify_authorizer 10.1.0.5:0/2023269 client protocol 0
node2: 2013-04-22 18:34:54.852957 7f9213cde700 1 -- 10.1.0.3:6789/0 >> :/0 pipe(0x2c7b400 sd=30 :6789 s=0 pgs=0 cs=0 l=0).accept sd=30 10.1.0.130:55280/0
node2: 2013-04-22 18:34:54.853111 7f9217c1d700 1 -- 10.1.0.3:6789/0 >> :/0 pipe(0x224d400 sd=44 :6789 s=0 pgs=0 cs=0 l=0).accept sd=44 10.1.0.130:55281/0
node2: 2013-04-22 18:34:54.853231 7f9213cde700 10 mon.a@0(leader) e1 ms_verify_authorizer 10.1.0.130:6803/10416 osd protocol 0
node2: 2013-04-22 18:34:54.853363 7f9217c1d700 10 mon.a@0(leader) e1 ms_verify_authorizer 10.1.0.130:6800/10337 osd protocol 0
node2: 2013-04-22 18:34:54.938310 7f92149eb700 1 -- 10.1.0.3:6789/0 >> :/0 pipe(0x224cf00 sd=47 :6789 s=0 pgs=0 cs=0 l=0).accept sd=47 10.1.0.135:56723/0
node2: 2013-04-22 18:34:54.938563 7f92149eb700 10 mon.a@0(leader) e1 ms_verify_authorizer 10.1.0.135:0/1028470 client protocol 0
node2: 2013-04-22 18:34:55.145707 7f9216c0d700 1 -- 10.1.0.3:6789/0 >> :/0 pipe(0x224db80 sd=43 :6789 s=0 pgs=0 cs=0 l=0).accept sd=43 10.1.0.67:50218/0
node2: 2013-04-22 18:34:55.145881 7f9216c0d700 10 mon.a@0(leader) e1 ms_verify_authorizer 10.1.0.67:0/23463 client protocol 0

node26: 2013-04-22 18:34:54.859384 7fc57e937700 10 mon.b@1(leader).paxosservice(auth) dispatch auth(proto 0 27 bytes epoch 1) v1 from osd.52 10.1.0.133:6803/7061
node26: 2013-04-22 18:34:54.859400 7fc578237700 10 mon.b@1(leader) e1 ms_verify_authorizer 10.1.0.130:0/12365 client protocol 0
node26: 2013-04-22 18:34:54.919734 7fc57e937700 10 mon.b@1(leader).paxosservice(auth) waiting for paxos > readable (v0)
node26: 2013-04-22 18:34:54.980501 7fc57e937700 1 mon.b@1(leader).paxos(paxos active c 506332..506733) is_readable now=2013-04-22 18:34:54.980508 lease_expire=2013-04-22 18:34:31.503062 has v0 lc 506733
node26: 2013-04-22 18:34:54.980540 7fc57e937700 10 mon.b@1(leader).paxosservice(auth) dispatch auth(proto 0 27 bytes epoch 1) v1 from osd.38 10.1.0.72:6806/25509
node26: 2013-04-22 18:34:55.040660 7fc57e937700 10 mon.b@1(leader).paxosservice(auth) waiting for paxos -> readable (v0)
node26: 2013-04-22 18:34:55.065444 7fc5792c6700 1 -
10.1.0.67:6789/0 >> :/0 pipe(0x62a4280 sd=23 :6789 s=0 pgs=0 cs=0 l=0).accept sd=23 10.1.0.73:42569/0
node26: 2013-04-22 18:34:55.065687 7fc5792c6700 10 mon.b@1(leader) e1 ms_verify_authorizer 10.1.0.73:6800/13437 osd protocol 0
node26: 2013-04-22 18:34:55.100940 7fc57e937700 1 mon.b@1(leader).paxos(paxos active c 506332..506733) is_readable now=2013-04-22 18:34:55.100948 lease_expire=2013-04-22 18:34:31.503062 has v0 lc 506733
node26: 2013-04-22 18:34:55.100987 7fc57e937700 10 mon.b@1(leader).paxosservice(auth) dispatch auth(proto 0 27 bytes epoch 1) v1 from osd.14 10.1.0.7:6806/2524

node49: 2013-04-22 18:34:54.915341 7fca4c00d700 10 mon.c@2(probing) e1 ms_verify_authorizer 10.1.0.130:0/12421 client protocol 0
node49: 2013-04-22 18:34:54.915819 7fca4edfb700 1 -- 10.1.0.130:6789/0 <== client.? 10.1.0.130:0/12421 1 ==== auth(proto 0 25 bytes epoch 1) v1 ==== 55+0+0 (1901731131 0 0) 0x7e5c000 con 0xc45cc60
node49: 2013-04-22 18:34:54.915841 7fca4edfb700 20 mon.c@2(probing) e1 have connection
node49: 2013-04-22 18:34:54.915848 7fca4edfb700 5 mon.c@2(probing) e1 waitlisting message auth(proto 0 25 bytes epoch 1) v1 until we get in quorum
node49: 2013-04-22 18:34:55.031142 7fca4affd700 1 -- 10.1.0.130:6789/0 >> :/0 pipe(0x3842c80 sd=26 :6789 s=0 pgs=0 cs=0 l=0).accept sd=26 10.1.0.69:35396/0
node49: 2013-04-22 18:34:55.031398 7fca4affd700 10 mon.c@2(probing) e1 ms_verify_authorizer 10.1.0.69:6800/2342 osd protocol 0
node49: 2013-04-22 18:34:55.031923 7fca4edfb700 1 -- 10.1.0.130:6789/0 <== osd.27 10.1.0.69:6800/2342 1 ==== auth(proto 0 27 bytes epoch 1) v1 ==== 57+0+0 (2528349954 0 0) 0x7e5c200 con 0xdbc6f20
node49: 2013-04-22 18:34:55.031952 7fca4edfb700 20 mon.c@2(probing) e1 have connection
node49: 2013-04-22 18:34:55.031961 7fca4edfb700 5 mon.c@2(probing) e1 waitlisting message auth(proto 0 27 bytes epoch 1) v1 until we get in quorum
node49: 2013-04-22 18:34:55.181864 7fca4edfb700 10 mon.c@2(probing) e1 ms_handle_reset 0x3930dc0 10.1.0.133:6800/6980

mon.a (leader), mon.b (electing), mon.c(electing)
-----------------------------------------------------
root@host:~# pdsh w node[2,26,49] tail -10 /var/log/ceph/ceph-mon*.log
node2: 2013-04-22 18:39:55.461146 7f92157f9700 10 mon.a@0(leader) e1 ms_verify_authorizer 10.1.0.5:0/2023269 client protocol 0
node2: 2013-04-22 18:39:55.461261 7f9216708700 10 mon.a@0(leader) e1 ms_verify_authorizer 10.1.0.5:0/1023269 client protocol 0
node2: 2013-04-22 18:39:55.563808 7f9217516700 1 -
10.1.0.3:6789/0 >> :/0 pipe(0x224db80 sd=26 :6789 s=0 pgs=0 cs=0 l=0).accept sd=26 10.1.0.70:57673/0
node2: 2013-04-22 18:39:55.564166 7f9217516700 10 mon.a@0(leader) e1 ms_verify_authorizer 10.1.0.70:6806/24784 osd protocol 0
node2: 2013-04-22 18:39:55.565958 7f92139db700 1 -- 10.1.0.3:6789/0 >> :/0 pipe(0x224d900 sd=57 :6789 s=0 pgs=0 cs=0 l=0).accept sd=57 10.1.0.136:56270/0
node2: 2013-04-22 18:39:55.566215 7f92139db700 10 mon.a@0(leader) e1 ms_verify_authorizer 10.1.0.136:6800/24343 osd protocol 0
node2: 2013-04-22 18:39:55.600330 7f9213fe1700 1 -- 10.1.0.3:6789/0 >> :/0 pipe(0x224cf00 sd=60 :6789 s=0 pgs=0 cs=0 l=0).accept sd=60 10.1.0.130:56402/0
node2: 2013-04-22 18:39:55.600571 7f9213fe1700 10 mon.a@0(leader) e1 ms_verify_authorizer 10.1.0.130:0/11127 client protocol 0
node2: 2013-04-22 18:39:55.674463 7f9213ddf700 1 -- 10.1.0.3:6789/0 >> :/0 pipe(0x224d400 sd=61 :6789 s=0 pgs=0 cs=0 l=0).accept sd=61 10.1.0.8:37867/0
node2: 2013-04-22 18:39:55.674705 7f9213ddf700 10 mon.a@0(leader) e1 ms_verify_authorizer 10.1.0.8:6800/16946 osd protocol 0

node26: 2013-04-22 18:39:55.476086 7fc57e937700 5 mon.b@1(electing) e1 waitlisting message auth(proto 0 26 bytes epoch 1) v1 until we get in quorum
node26: 2013-04-22 18:39:55.479530 7fc571465700 1 -- 10.1.0.67:6789/0 >> :/0 pipe(0x1c871400 sd=55 :6789 s=0 pgs=0 cs=0 l=0).accept sd=55 10.1.0.131:52462/0
node26: 2013-04-22 18:39:55.479764 7fc571465700 10 mon.b@1(electing) e1 ms_verify_authorizer 10.1.0.131:6806/8491 osd protocol 0
node26: 2013-04-22 18:39:55.480258 7fc57e937700 1 -- 10.1.0.67:6789/0 <== osd.47 10.1.0.131:6806/8491 1 ==== auth(proto 0 27 bytes epoch 1) v1 ==== 57+0+0 (367315178 0 0) 0x33ae8200 con 0x334c1b80
node26: 2013-04-22 18:39:55.480283 7fc57e937700 20 mon.b@1(electing) e1 have connection
node26: 2013-04-22 18:39:55.480289 7fc57e937700 5 mon.b@1(electing) e1 waitlisting message auth(proto 0 27 bytes epoch 1) v1 until we get in quorum
node26: 2013-04-22 18:39:55.493848 7fc57e937700 10 mon.b@1(electing) e1 ms_handle_reset 0x334b7760 10.1.0.3:6803/13323
node26: 2013-04-22 18:39:55.569804 7fc57e937700 10 mon.b@1(electing) e1 ms_handle_reset 0x334b78c0 10.1.0.70:6806/24784
node26: 2013-04-22 18:39:55.606314 7fc57e937700 10 mon.b@1(electing) e1 ms_handle_reset 0x334b74a0 10.1.0.130:0/11127
node26: 2013-04-22 18:39:55.754041 7fc57e937700 10 mon.b@1(electing) e1 ms_handle_reset 0x2e76c840 10.1.0.3:6800/13243

node49: 2013-04-22 18:39:55.750414 7fca4b1ff700 10 mon.c@2(electing) e1 ms_verify_authorizer 10.1.0.3:6800/13243 osd protocol 0
node49: 2013-04-22 18:39:55.750760 7fca4edfb700 1 -- 10.1.0.130:6789/0 <== osd.0 10.1.0.3:6800/13243 1 ==== auth(proto 0 26 bytes epoch 1) v1 ==== 56+0+0 (2724431805 0 0) 0x7bd7c00 con 0xc36e6e0
node49: 2013-04-22 18:39:55.750786 7fca4edfb700 20 mon.c@2(electing) e1 have connection
node49: 2013-04-22 18:39:55.750792 7fca4edfb700 5 mon.c@2(electing) e1 waitlisting message auth(proto 0 26 bytes epoch 1) v1 until we get in quorum
node49: 2013-04-22 18:39:55.760823 7fca44077700 1 -- 10.1.0.130:6789/0 >> :/0 pipe(0x2e70c80 sd=32 :6789 s=0 pgs=0 cs=0 l=0).accept sd=32 10.1.0.71:44484/0
node49: 2013-04-22 18:39:55.761082 7fca44077700 10 mon.c@2(electing) e1 ms_verify_authorizer 10.1.0.71:6800/21789 osd protocol 0
node49: 2013-04-22 18:39:55.761556 7fca4edfb700 1 -- 10.1.0.130:6789/0 <== osd.33 10.1.0.71:6800/21789 1 ==== auth(proto 0 27 bytes epoch 1) v1 ==== 57+0+0 (124648215 0 0) 0x291de00 con 0xc36e580
node49: 2013-04-22 18:39:55.761581 7fca4edfb700 20 mon.c@2(electing) e1 have connection
node49: 2013-04-22 18:39:55.761587 7fca4edfb700 5 mon.c@2(electing) e1 waitlisting message auth(proto 0 27 bytes epoch 1) v1 until we get in quorum
node49: 2013-04-22 18:39:55.786654 7fca4edfb700 10 mon.c@2(electing) e1 ms_handle_reset 0xc36fe40 10.1.0.71:6806/21953

mon.a (leader), mon.b (leader), mon.c(peon)
-----------------------------------------------------
root@host:~# pdsh -w node[2,26,49] tail -10 /var/log/ceph/ceph-mon*.log

node2: 2013-04-22 18:40:06.958254 7f9212bcd700 1 -- 10.1.0.3:6789/0 >> :/0 pipe(0x32f9400 sd=29 :6789 s=0 pgs=0 cs=0 l=0).accept sd=29 10.1.0.69:56128/0
node2: 2013-04-22 18:40:06.958529 7f9212bcd700 10 mon.a@0(leader) e1 ms_verify_authorizer 10.1.0.69:6806/2526 osd protocol 0
node2: 2013-04-22 18:40:07.096353 7f9216809700 1 -- 10.1.0.3:6789/0 >> :/0 pipe(0x32f9900 sd=30 :6789 s=0 pgs=0 cs=0 l=0).accept sd=30 10.1.0.69:56129/0
node2: 2013-04-22 18:40:07.096649 7f9216809700 10 mon.a@0(leader) e1 ms_verify_authorizer 10.1.0.69:6800/2342 osd protocol 0
node2: 2013-04-22 18:40:07.310774 7f92134d6700 1 -- 10.1.0.3:6789/0 >> :/0 pipe(0x32f8a00 sd=31 :6789 s=0 pgs=0 cs=0 l=0).accept sd=31 10.1.0.134:32801/0
node2: 2013-04-22 18:40:07.310956 7f92134d6700 10 mon.a@0(leader) e1 ms_verify_authorizer 10.1.0.134:6800/1630 osd protocol 0
node2: 2013-04-22 18:40:07.424355 7f9217b1c700 1 -- 10.1.0.3:6789/0 >> :/0 pipe(0x32f8000 sd=34 :6789 s=0 pgs=0 cs=0 l=0).accept sd=34 10.1.0.134:32803/0
node2: 2013-04-22 18:40:07.424603 7f9217b1c700 10 mon.a@0(leader) e1 ms_verify_authorizer 10.1.0.134:6803/1712 osd protocol 0
node2: 2013-04-22 18:40:07.458204 7f9215cfe700 1 -- 10.1.0.3:6789/0 >> :/0 pipe(0x32f9680 sd=21 :6789 s=0 pgs=0 cs=0 l=0).accept sd=21 10.1.0.5:59294/0
node2: 2013-04-22 18:40:07.458401 7f9215cfe700 10 mon.a@0(leader) e1 ms_verify_authorizer 10.1.0.5:6800/10642 osd protocol 0

node26: 2013-04-22 18:40:07.262310 7fc57e937700 1 mon.b@1(leader).paxos(paxos active c 506332..506740) is_readable now=2013-04-22 18:40:07.262311 lease_expire=2013-04-22 18:40:07.189911 has v0 lc 506740
node26: 2013-04-22 18:40:07.262319 7fc57e937700 10 mon.b@1(leader).paxosservice(osdmap) dispatch osd_alive(want up_thru 1011 have 1012) v1 from osd.46 10.1.0.131:6803/8398
node26: 2013-04-22 18:40:07.301676 7fc572058700 1 -- 10.1.0.67:6789/0 >> :/0 pipe(0x29c8fb80 sd=27 :6789 s=0 pgs=0 cs=0 l=0).accept sd=27 10.1.0.68:49534/0
node26: 2013-04-22 18:40:07.301932 7fc572058700 10 mon.b@1(leader) e1 ms_verify_authorizer 10.1.0.68:6800/16059 osd protocol 0
node26: 2013-04-22 18:40:07.308522 7fc57e937700 10 mon.b@1(leader).paxosservice(osdmap) waiting for paxos -> readable (v1012)
node26: 2013-04-22 18:40:07.355180 7fc57e937700 1 mon.b@1(leader).paxos(paxos active c 506332..506740) is_readable now=2013-04-22 18:40:07.355188 lease_expire=2013-04-22 18:40:07.189911 has v0 lc 506740
node26: 2013-04-22 18:40:07.355223 7fc57e937700 10 mon.b@1(leader).paxosservice(osdmap) dispatch osd_alive(want up_thru 1011 have 1012) v1 from osd.49 10.1.0.132:6803/17340
node26: 2013-04-22 18:40:07.401330 7fc57e937700 10 mon.b@1(leader).paxosservice(osdmap) waiting for paxos -> readable (v1012)
node26: 2013-04-22 18:40:07.447476 7fc57e937700 1 mon.b@1(leader).paxos(paxos active c 506332..506740) is_readable now=2013-04-22 18:40:07.447484 lease_expire=2013-04-22 18:40:07.189911 has v0 lc 506740
node26: 2013-04-22 18:40:07.447513 7fc57e937700 10 mon.b@1(leader).paxosservice(osdmap) dispatch osd_alive(want up_thru 1011 have 1012) v1 from osd.9 10.1.0.6:6806/17276

node49: 2013-04-22 18:40:07.466119 7fca4edfb700 20 mon.c@2(peon) e1 have connection
node49: 2013-04-22 18:40:07.466125 7fca4edfb700 20 mon.c@2(peon) e1 ms_dispatch existing session MonSession: osd.53 10.1.0.133:6806/7166 is openallow rwx for osd.53 10.1.0.133:6806/7166
node49: 2013-04-22 18:40:07.466135 7fca4edfb700 20 mon.c@2(peon) e1 caps allow rwx
node49: 2013-04-22 18:40:07.466138 7fca4edfb700 10 mon.c@2(peon) e1 handle_subscribe mon_subscribe({monmap=2+,osd_pg_creates=0,osdmap=1013}) v2
node49: 2013-04-22 18:40:07.466145 7fca4edfb700 10 mon.c@2(peon) e1 check_sub monmap next 2 have 1
node49: 2013-04-22 18:40:07.466154 7fca4edfb700 5 mon.c@2(peon).osd e1013 send_incremental [1013..1013] to osd.53 10.1.0.133:6806/7166
node49: 2013-04-22 18:40:07.466208 7fca4edfb700 10 mon.c@2(peon).osd e1013 build_incremental [1013..1013]
node49: 2013-04-22 18:40:07.466265 7fca4edfb700 20 mon.c@2(peon).osd e1013 build_incremental inc 1013 164 bytes
node49: 2013-04-22 18:40:07.466271 7fca4edfb700 1 -- 10.1.0.130:6789/0 --> osd.53 10.1.0.133:6806/7166 -- osd_map(1013..1013 src has 512..1013) v3 -- ?+0 0xe016e00
node49: 2013-04-22 18:40:07.466296 7fca4edfb700 1 -- 10.1.0.130:6789/0 --> osd.53 10.1.0.133:6806/7166 -- mon_subscribe_ack(300s) v1 -- ?+0 0x7668000

Actions #1

Updated by Ian Colle almost 11 years ago

  • Assignee set to Greg Farnum
  • Priority changed from High to Urgent
Actions #2

Updated by Greg Farnum almost 11 years ago

  • Status changed from New to Need More Info

Got Kevin on irc and am waiting for logs of when this first happens, if possible. I'll go review some of the election code as well; I don't remember what our specific invariants are around leases and leaders (on first glance I'm not sure the two leader thing is a problem except insofar as the monitors clearly aren't communicating with each other).

Actions #3

Updated by Greg Farnum almost 11 years ago

  • Status changed from Need More Info to In Progress

I've got more digging to do to verify my diagnose on the listed times, but so far what I'm seeing looks like the leader is getting weighed down by large pgmap commits and piled up incoming messages, then somebody else takes over and the original leader takes a while getting back to where it can talk to anybody. I don't think the snipped logs above are from the same election sequence, and these issues should be resolved in current "next".

But I'll check a bit more.

Actions #4

Updated by Greg Farnum almost 11 years ago

  • Status changed from In Progress to Closed

Yeah, leveldb on mon.a went totally out to lunch — it tried to pass through a transaction and never finished, as best I can tell:

2013-04-22 14:00:24.889877 7f9218e21700 10 mon.a@0(leader).elector(634) bump_epoch 634 to 635

There should have been follow-on elector output there, but instead that thread doesn't reappear in the log at any point following. The messenger threads kept running but were unable to do anything, and the dispatch loop holds the big monitor lock which prevented anything from actually kicking the monitor out of its leader state.
mon.b and mon.c seem to be doing what they're supposed to, luckily — I assume that running "ceph -s" had mon.a marked down or something? There are a number of things which would have made this more likely that are fixed in the current next branch (leveldb tuning changes, turning off/down debug output of proposals which are taking up tons of space here, etc), so I think we're safe for Cuttlefish.

Actions #5

Updated by Mike Dawson almost 11 years ago

Greg, during this state ceph -s hangs for longer than I have waited (several minutes). All RBD volumes are stalled/unreachable as well.

Actions #6

Updated by Greg Farnum almost 11 years ago

Oh, yeah. Looks like there were a bunch of backed up messages, and the second leader was having as much trouble with leveldb as the first one was. You've got a larger pgmap than most of our test clusters do and so it doesn't fit inside the default config as nicely; Cuttlefish has more generous leveldb defaults and should be good to go.

Actions

Also available in: Atom PDF