Project

General

Profile

Bug #4793

mon/Monitor.cc: 1126: FAILED assert(!(sync_role & SYNC_ROLE_REQUESTER))

Added by Mike Dawson almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Monitor
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

During the process of attempting to sync a new or behind monitor with 0.60, I have seen mon/Monitor.cc: 1126: FAILED assert(!(sync_role & SYNC_ROLE_REQUESTER)). This assert is confirmed by another user (matt_ on irc) with paxos quorum issues in 0.60.

root@node2:/var/log/ceph# tail -300 ceph-mon.a.log.1
  -141> 2013-04-20 17:35:04.307290 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x9c4f5340 10.1.0.8:0/1004379
  -140> 2013-04-20 17:35:04.307294 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x9c4f58c0 10.1.0.7:6806/16865
  -139> 2013-04-20 17:35:04.307298 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x9194a160 10.1.0.71:6806/32767
  -138> 2013-04-20 17:35:04.307301 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x9194ac60 10.1.0.67:0/1255
  -137> 2013-04-20 17:35:04.307305 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x9194af20 10.1.0.67:0/1271
  -136> 2013-04-20 17:35:04.307309 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x9194ba20 10.1.0.130:0/15508
  -135> 2013-04-20 17:35:04.307314 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0xa6f4e000 10.1.0.67:0/1283
  -134> 2013-04-20 17:35:04.307319 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0xa6f4e580 10.1.0.135:0/1028470
  -133> 2013-04-20 17:35:04.307323 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0xa6f4eb00 10.1.0.135:0/2028470
  -132> 2013-04-20 17:35:04.307327 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0xa6f4f340 10.1.0.8:6803/24818
  -131> 2013-04-20 17:35:04.307330 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0xa6f4f8c0 10.1.0.130:0/15518
  -130> 2013-04-20 17:35:04.307334 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x9de8a6e0 10.1.0.6:6803/30723
  -129> 2013-04-20 17:35:04.307338 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x9de8a160 10.1.0.137:6803/5123
  -128> 2013-04-20 17:35:04.307342 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x9de8ac60 10.1.0.130:0/15528
  -127> 2013-04-20 17:35:04.307346 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x9de8b4a0 10.1.0.68:6806/30592
  -126> 2013-04-20 17:35:04.307351 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x91ba82c0 10.1.0.3:0/23812
  -125> 2013-04-20 17:35:04.307357 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x91ba8840 10.1.0.134:6806/4204
  -124> 2013-04-20 17:35:04.307361 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x91ba8dc0 10.1.0.4:6803/29678
  -123> 2013-04-20 17:35:04.307366 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x91ba9340 10.1.0.9:6806/17004
  -122> 2013-04-20 17:35:04.307372 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x91ba9b80 10.1.0.6:6806/31726
  -121> 2013-04-20 17:35:04.307375 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x91ba9e40 10.1.0.68:6803/30491
  -120> 2013-04-20 17:35:04.307379 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0xa6f4ac60 10.1.0.69:6803/8122
  -119> 2013-04-20 17:35:04.307384 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0xa6f4af20 10.1.0.130:6800/5691
  -118> 2013-04-20 17:35:04.307388 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0xa6f4b1e0 10.1.0.4:6806/29925
  -117> 2013-04-20 17:35:04.307392 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0xa6f4b760 10.1.0.6:6800/30645
  -116> 2013-04-20 17:35:04.307397 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x9d6a62c0 10.1.0.4:6800/29454
  -115> 2013-04-20 17:35:04.307401 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x9d6a6580 10.1.0.136:6806/16387
  -114> 2013-04-20 17:35:04.307405 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x9d6a6840 10.1.0.70:6806/17398
  -113> 2013-04-20 17:35:04.307411 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0xa6f61a20 10.1.0.9:6800/16811
  -112> 2013-04-20 17:35:04.307415 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x9ab86000 10.1.0.72:6803/5685
  -111> 2013-04-20 17:35:04.307420 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x9ab86b00 10.1.0.3:6803/3002
  -110> 2013-04-20 17:35:04.307424 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x9ab87b80 10.1.0.5:6800/14631
  -109> 2013-04-20 17:35:04.307428 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x97e0c420 10.1.0.8:6800/24740
  -108> 2013-04-20 17:35:04.307432 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x97e0cf20 10.1.0.130:0/15439
  -107> 2013-04-20 17:35:04.307436 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x97e0d4a0 10.1.0.69:6806/8200
  -106> 2013-04-20 17:35:04.307441 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0x97e0da20 10.1.0.70:6803/17296
  -105> 2013-04-20 17:35:04.307445 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0xa6f5a2c0 10.1.0.72:6800/5606
  -104> 2013-04-20 17:35:04.307450 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 ms_handle_reset 0xa6f5a840 10.1.0.135:6800/15336
  -103> 2013-04-20 17:35:04.307455 7f905a33d700  1 -- 10.1.0.3:6789/0 <== mon.1 10.1.0.67:6789/0 393371 ==== mon_health( service 1 op tell e 0 r 0 flags ) v1 ==== 62+0+0 (3701682562 0 0) 0x40543c0 con 0x2a83080
  -102> 2013-04-20 17:35:04.307463 7f905a33d700 20 mon.a@0(probing sync( leader state none )) e1 have connection
  -101> 2013-04-20 17:35:04.307465 7f905a33d700 20 mon.a@0(probing sync( leader state none )) e1 ms_dispatch existing session MonSession: mon.1 10.1.0.67:6789/0 is openallow * for mon.1 10.1.0.67:6789/0
  -100> 2013-04-20 17:35:04.307470 7f905a33d700 20 mon.a@0(probing sync( leader state none )) e1  caps allow *
   -99> 2013-04-20 17:35:04.307474 7f905a33d700 10 mon.a@0(probing sync( leader state none )).data_health(596) service_dispatch mon_health( service 1 op tell e 0 r 0 flags ) v1
   -98> 2013-04-20 17:35:04.307477 7f905a33d700  1 mon.a@0(probing sync( leader state none )).data_health(596) service_dispatch not in quorum -- drop message
   -97> 2013-04-20 17:35:04.307481 7f905a33d700  1 -- 10.1.0.3:6789/0 <== mon.2 10.1.0.130:6789/0 1645580482 ==== mon_probe(reply 2f2730c5-0504-4433-ae0b-331dd41d99a4 name c quorum 0,1,2 paxos( fc 429002 lc 429022 )) v4 ==== 567+0+0 (2281760352 0 0) 0x282a600 con 0x2a82dc0
   -96> 2013-04-20 17:35:04.307492 7f905a33d700 20 mon.a@0(probing sync( leader state none )) e1 have connection
   -95> 2013-04-20 17:35:04.307494 7f905a33d700 20 mon.a@0(probing sync( leader state none )) e1 ms_dispatch existing session MonSession: mon.2 10.1.0.130:6789/0 is openallow * for mon.2 10.1.0.130:6789/0
   -94> 2013-04-20 17:35:04.307499 7f905a33d700 20 mon.a@0(probing sync( leader state none )) e1  caps allow *
   -93> 2013-04-20 17:35:04.307502 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 handle_probe mon_probe(reply 2f2730c5-0504-4433-ae0b-331dd41d99a4 name c quorum 0,1,2 paxos( fc 429002 lc 429022 )) v4
   -92> 2013-04-20 17:35:04.307509 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 handle_probe_reply mon.2 10.1.0.130:6789/0mon_probe(reply 2f2730c5-0504-4433-ae0b-331dd41d99a4 name c quorum 0,1,2 paxos( fc 429002 lc 429022 )) v4
   -91> 2013-04-20 17:35:04.307515 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1  monmap is e1: 3 mons at {a=10.1.0.3:6789/0,b=10.1.0.67:6789/0,c=10.1.0.130:6789/0}
   -90> 2013-04-20 17:35:04.307531 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1  peer name is c
   -89> 2013-04-20 17:35:04.307535 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1  existing quorum 0,1,2
   -88> 2013-04-20 17:35:04.307537 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1  peer paxos version 429022 vs my version 428984 (too far ahead)
   -87> 2013-04-20 17:35:04.307541 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 cancel_probe_timeout 0xa6ad1650
   -86> 2013-04-20 17:35:04.307544 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 sync_start entity( mon.2 10.1.0.130:6789/0 )
   -85> 2013-04-20 17:35:04.307548 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 sync_start we are acting as a leader to someone; destroy their dreams
   -84> 2013-04-20 17:35:04.307552 7f905a33d700 10 mon.a@0(probing sync( leader state none )) e1 reset_sync
   -83> 2013-04-20 17:35:04.386956 7f90559bf700  2 -- 10.1.0.3:6789/0 >> 10.1.0.135:6806/15524 pipe(0x1b4c9b80 sd=43 :6789 s=2 pgs=2621 cs=1 l=1).reader couldn't read tag, Success
   -82> 2013-04-20 17:35:04.387041 7f90559bf700  2 -- 10.1.0.3:6789/0 >> 10.1.0.135:6806/15524 pipe(0x1b4c9b80 sd=43 :6789 s=2 pgs=2621 cs=1 l=1).fault 0: Success
   -81> 2013-04-20 17:35:04.395992 7f90559bf700  1 -- 10.1.0.3:6789/0 >> :/0 pipe(0x1b4c8c80 sd=34 :6789 s=0 pgs=0 cs=0 l=0).accept sd=34 10.1.0.132:39671/0
   -80> 2013-04-20 17:35:04.396254 7f90559bf700 10 mon.a@0(synchronizing sync( requester state start )) e1 ms_verify_authorizer 10.1.0.132:6803/31996 osd protocol 0
   -79> 2013-04-20 17:35:04.396759 7f904d31b700  2 -- 10.1.0.3:6789/0 >> 10.1.0.67:0/1224 pipe(0x26ff680 sd=44 :6789 s=2 pgs=95 cs=1 l=1).reader couldn't read tag, Success
   -78> 2013-04-20 17:35:04.396790 7f904d31b700  2 -- 10.1.0.3:6789/0 >> 10.1.0.67:0/1224 pipe(0x26ff680 sd=44 :6789 s=2 pgs=95 cs=1 l=1).fault 0: Success
   -77> 2013-04-20 17:35:04.459945 7f9051793700  2 -- 10.1.0.3:6789/0 >> 10.1.0.135:6803/15419 pipe(0x6d709680 sd=24 :6789 s=2 pgs=2744 cs=1 l=1).reader couldn't read tag, Success
   -76> 2013-04-20 17:35:04.460035 7f9051793700  2 -- 10.1.0.3:6789/0 >> 10.1.0.135:6803/15419 pipe(0x6d709680 sd=24 :6789 s=2 pgs=2744 cs=1 l=1).fault 0: Success
   -75> 2013-04-20 17:35:04.472004 7f9051793700  1 -- 10.1.0.3:6789/0 >> :/0 pipe(0x1b4c9400 sd=24 :6789 s=0 pgs=0 cs=0 l=0).accept sd=24 10.1.0.134:42894/0
   -74> 2013-04-20 17:35:04.472324 7f9051793700 10 mon.a@0(synchronizing sync( requester state start )) e1 ms_verify_authorizer 10.1.0.134:6803/4103 osd protocol 0
   -73> 2013-04-20 17:35:04.506702 7f904d31b700  1 -- 10.1.0.3:6789/0 >> :/0 pipe(0x1b4c8a00 sd=36 :6789 s=0 pgs=0 cs=0 l=0).accept sd=36 10.1.0.133:51138/0
   -72> 2013-04-20 17:35:04.506983 7f904d31b700 10 mon.a@0(synchronizing sync( requester state start )) e1 ms_verify_authorizer 10.1.0.133:6800/18670 osd protocol 0
   -71> 2013-04-20 17:35:04.514085 7f905271d700  1 -- 10.1.0.3:6789/0 >> :/0 pipe(0x1b4c8000 sd=37 :6789 s=0 pgs=0 cs=0 l=0).accept sd=37 10.1.0.137:40819/0
   -70> 2013-04-20 17:35:04.514316 7f905271d700 10 mon.a@0(synchronizing sync( requester state start )) e1 ms_verify_authorizer 10.1.0.137:6800/5044 osd protocol 0
   -69> 2013-04-20 17:35:04.535680 7f9050884700  2 -- 10.1.0.3:6789/0 >> 10.1.0.7:6800/16705 pipe(0x26ff900 sd=22 :6789 s=2 pgs=2372 cs=1 l=1).reader couldn't read tag, Success
   -68> 2013-04-20 17:35:04.535766 7f9050884700  2 -- 10.1.0.3:6789/0 >> 10.1.0.7:6800/16705 pipe(0x26ff900 sd=22 :6789 s=2 pgs=2372 cs=1 l=1).fault 0: Success
   -67> 2013-04-20 17:35:04.672830 7f9051591700  2 -- 10.1.0.3:6789/0 >> 10.1.0.67:6806/23784 pipe(0x26fea00 sd=23 :6789 s=2 pgs=2362 cs=1 l=1).reader couldn't read tag, Success
   -66> 2013-04-20 17:35:04.672884 7f9051591700  2 -- 10.1.0.3:6789/0 >> 10.1.0.67:6806/23784 pipe(0x26fea00 sd=23 :6789 s=2 pgs=2362 cs=1 l=1).fault 0: Success
   -65> 2013-04-20 17:35:05.071029 7f9054331700  2 -- 10.1.0.3:6789/0 >> 10.1.0.70:6800/17218 pipe(0x6d709b80 sd=5 :6789 s=2 pgs=2533 cs=1 l=1).reader couldn't read tag, Success
   -64> 2013-04-20 17:35:05.071120 7f9054331700  2 -- 10.1.0.3:6789/0 >> 10.1.0.70:6800/17218 pipe(0x6d709b80 sd=5 :6789 s=2 pgs=2533 cs=1 l=1).fault 0: Success
   -63> 2013-04-20 17:35:05.190050 7f9054331700  1 -- 10.1.0.3:6789/0 >> :/0 pipe(0x1b4c9900 sd=5 :6789 s=0 pgs=0 cs=0 l=0).accept sd=5 10.1.0.132:39672/0
   -62> 2013-04-20 17:35:05.190327 7f9054331700 10 mon.a@0(synchronizing sync( requester state start )) e1 ms_verify_authorizer 10.1.0.132:6800/31916 osd protocol 0
   -61> 2013-04-20 17:35:05.374483 7f9051591700  1 -- 10.1.0.3:6789/0 >> :/0 pipe(0x1b4c8f00 sd=22 :6789 s=0 pgs=0 cs=0 l=0).accept sd=22 10.1.0.72:51607/0
   -60> 2013-04-20 17:35:05.374750 7f9051591700 10 mon.a@0(synchronizing sync( requester state start )) e1 ms_verify_authorizer 10.1.0.72:6806/5787 osd protocol 0
   -59> 2013-04-20 17:35:05.473245 7f904f8b0700  2 -- 10.1.0.3:6789/0 >> 10.1.0.133:6806/18842 pipe(0x4f59b80 sd=25 :6789 s=2 pgs=2451 cs=1 l=1).reader couldn't read tag, Success
   -58> 2013-04-20 17:35:05.473297 7f904f8b0700  2 -- 10.1.0.3:6789/0 >> 10.1.0.133:6806/18842 pipe(0x4f59b80 sd=25 :6789 s=2 pgs=2451 cs=1 l=1).fault 0: Success
   -57> 2013-04-20 17:35:05.531613 7f905399f700  2 -- 10.1.0.3:6789/0 >> 10.1.0.133:6803/18749 pipe(0x4f58a00 sd=26 :6789 s=2 pgs=2616 cs=1 l=1).reader couldn't read tag, Success
   -56> 2013-04-20 17:35:05.531669 7f905399f700  2 -- 10.1.0.3:6789/0 >> 10.1.0.133:6803/18749 pipe(0x4f58a00 sd=26 :6789 s=2 pgs=2616 cs=1 l=1).fault 0: Success
   -55> 2013-04-20 17:35:05.558114 7f905399f700  1 -- 10.1.0.3:6789/0 >> :/0 pipe(0x1b4c9680 sd=23 :6789 s=0 pgs=0 cs=0 l=0).accept sd=23 10.1.0.4:39717/0
   -54> 2013-04-20 17:35:05.558439 7f905399f700 10 mon.a@0(synchronizing sync( requester state start )) e1 ms_verify_authorizer 10.1.0.4:6806/29925 osd protocol 0
   -53> 2013-04-20 17:35:05.622034 7f904f8b0700  1 -- 10.1.0.3:6789/0 >> :/0 pipe(0x1b4c9180 sd=25 :6789 s=0 pgs=0 cs=0 l=0).accept sd=25 10.1.0.7:41548/0
   -52> 2013-04-20 17:35:05.622306 7f904f8b0700 10 mon.a@0(synchronizing sync( requester state start )) e1 ms_verify_authorizer 10.1.0.7:6803/16784 osd protocol 0
   -51> 2013-04-20 17:35:05.744325 7f9054533700  2 -- 10.1.0.3:6789/0 >> 10.1.0.8:6806/24896 pipe(0x4f58f00 sd=28 :6789 s=2 pgs=2613 cs=1 l=1).reader couldn't read tag, Success
   -50> 2013-04-20 17:35:05.744473 7f9054533700  2 -- 10.1.0.3:6789/0 >> 10.1.0.8:6806/24896 pipe(0x4f58f00 sd=28 :6789 s=2 pgs=2613 cs=1 l=1).fault 0: Success
   -49> 2013-04-20 17:35:05.770280 7f905de48700  2 -- 10.1.0.3:6789/0 >> 10.1.0.9:6803/16890 pipe(0x8e42c280 sd=32 :6789 s=2 pgs=2419 cs=1 l=1).reader couldn't read tag, Success
   -48> 2013-04-20 17:35:05.770330 7f905de48700  2 -- 10.1.0.3:6789/0 >> 10.1.0.9:6803/16890 pipe(0x8e42c280 sd=32 :6789 s=2 pgs=2419 cs=1 l=1).fault 0: Success
   -47> 2013-04-20 17:35:05.894104 7f9054735700  2 -- 10.1.0.3:6789/0 >> 10.1.0.8:0/1004379 pipe(0x8e42c780 sd=33 :6789 s=2 pgs=5402 cs=1 l=1).reader couldn't read tag, Success
   -46> 2013-04-20 17:35:05.894156 7f9054735700  2 -- 10.1.0.3:6789/0 >> 10.1.0.8:0/1004379 pipe(0x8e42c780 sd=33 :6789 s=2 pgs=5402 cs=1 l=1).fault 0: Success
   -45> 2013-04-20 17:35:05.932337 7f9054735700  1 -- 10.1.0.3:6789/0 >> :/0 pipe(0x1b4c8280 sd=26 :6789 s=0 pgs=0 cs=0 l=0).accept sd=26 10.1.0.7:41549/0
   -44> 2013-04-20 17:35:05.932591 7f9054735700 10 mon.a@0(synchronizing sync( requester state start )) e1 ms_verify_authorizer 10.1.0.7:6806/16865 osd protocol 0
   -43> 2013-04-20 17:35:06.039307 7f905a33d700  1 -- 10.1.0.3:6789/0 --> mon.2 10.1.0.130:6789/0 -- mon_sync( start ) v1 -- ?+0 0x5674dc0
   -42> 2013-04-20 17:35:06.039372 7f905a33d700 10 mon.a@0(synchronizing sync( requester state start )) e1 ms_handle_reset 0xa6f5b340 10.1.0.135:6806/15524
   -41> 2013-04-20 17:35:06.039396 7f905a33d700 10 mon.a@0(synchronizing sync( requester state start )) e1 ms_handle_reset 0xa6f5b600 10.1.0.67:0/1224
   -40> 2013-04-20 17:35:06.039404 7f905a33d700 10 mon.a@0(synchronizing sync( requester state start )) e1 ms_handle_reset 0xa0c35b80 10.1.0.135:6803/15419
   -39> 2013-04-20 17:35:06.039413 7f905a33d700 10 mon.a@0(synchronizing sync( requester state start )) e1 ms_handle_reset 0xa6f591e0 10.1.0.7:6800/16705
   -38> 2013-04-20 17:35:06.039420 7f905a33d700 10 mon.a@0(synchronizing sync( requester state start )) e1 ms_handle_reset 0xa6f594a0 10.1.0.67:6806/23784
   -37> 2013-04-20 17:35:06.039426 7f905a33d700 10 mon.a@0(synchronizing sync( requester state start )) e1 ms_handle_reset 0xa0c35600 10.1.0.70:6800/17218
   -36> 2013-04-20 17:35:06.039433 7f905a33d700 10 mon.a@0(synchronizing sync( requester state start )) e1 ms_handle_reset 0x4fb58160 10.1.0.133:6806/18842
   -35> 2013-04-20 17:35:06.039444 7f905a33d700 10 mon.a@0(synchronizing sync( requester state start )) e1 ms_handle_reset 0x4fb589a0 10.1.0.133:6803/18749
   -34> 2013-04-20 17:35:06.039452 7f905a33d700 10 mon.a@0(synchronizing sync( requester state start )) e1 ms_handle_reset 0x4fb58c60 10.1.0.8:6806/24896
   -33> 2013-04-20 17:35:06.039460 7f905a33d700 10 mon.a@0(synchronizing sync( requester state start )) e1 ms_handle_reset 0x4fb594a0 10.1.0.9:6803/16890
   -32> 2013-04-20 17:35:06.039466 7f905a33d700 10 mon.a@0(synchronizing sync( requester state start )) e1 ms_handle_reset 0x4fb59ce0 10.1.0.8:0/1004379
   -31> 2013-04-20 17:35:06.039473 7f905a33d700  1 -- 10.1.0.3:6789/0 <== mon.0 10.1.0.3:6789/0 0 ==== mon_sync( start_reply ) v1 ==== 0+0+0 (0 0 0) 0x13f68840 con 0x2652420
   -30> 2013-04-20 17:35:06.039480 7f905a33d700 20 mon.a@0(synchronizing sync( requester state start )) e1 have connection
   -29> 2013-04-20 17:35:06.039483 7f905a33d700 20 mon.a@0(synchronizing sync( requester state start )) e1 ms_dispatch existing session MonSession: mon.0 10.1.0.3:6789/0 is openallow * for mon.0 10.1.0.3:6789/0
   -28> 2013-04-20 17:35:06.039490 7f905a33d700 20 mon.a@0(synchronizing sync( requester state start )) e1  caps allow *
   -27> 2013-04-20 17:35:06.039492 7f905a33d700 10 mon.a@0(synchronizing sync( requester state start )) e1 handle_sync mon_sync( start_reply ) v1
   -26> 2013-04-20 17:35:06.039497 7f905a33d700 10 mon.a@0(synchronizing sync( requester state start )) e1 handle_sync_start_reply mon_sync( start_reply ) v1
   -25> 2013-04-20 17:35:06.039517 7f905a33d700 10 mon.a@0(synchronizing sync( requester state start )) e1 handle_sync_start_reply synchronizing from leader at mon.0 10.1.0.3:6789/0
   -24> 2013-04-20 17:35:06.039523 7f905a33d700 10 mon.a@0(synchronizing sync( requester state start )) e1 sync_send_heartbeat mon.0 10.1.0.3:6789/0 reply(0)
   -23> 2013-04-20 17:35:06.039528 7f905a33d700  1 -- 10.1.0.3:6789/0 --> mon.0 10.1.0.3:6789/0 -- mon_sync( heartbeat ) v1 -- ?+0 0x5202b00
   -22> 2013-04-20 17:35:06.039536 7f905a33d700 10 mon.a@0(synchronizing sync( requester state start )) e1 sync_start_chunks provider(mon.0 10.1.0.3:6789/0)
   -21> 2013-04-20 17:35:06.039543 7f905a33d700  1 -- 10.1.0.3:6789/0 --> mon.0 10.1.0.3:6789/0 -- mon_sync( start_chunks ) v1 -- ?+0 0x1b561340
   -20> 2013-04-20 17:35:06.039550 7f905a33d700  1 -- 10.1.0.3:6789/0 <== mon.1 10.1.0.67:6789/0 393372 ==== mon_probe(reply 2f2730c5-0504-4433-ae0b-331dd41d99a4 name b quorum 0,1,2 paxos( fc 429002 lc 429022 )) v4 ==== 567+0+0 (92848010 0 0) 0x282a900 con 0x2a83080
   -19> 2013-04-20 17:35:06.039562 7f905a33d700 20 mon.a@0(synchronizing sync( requester state chunks )) e1 have connection
   -18> 2013-04-20 17:35:06.039565 7f905a33d700 20 mon.a@0(synchronizing sync( requester state chunks )) e1 ms_dispatch existing session MonSession: mon.1 10.1.0.67:6789/0 is openallow * for mon.1 10.1.0.67:6789/0
   -17> 2013-04-20 17:35:06.039570 7f905a33d700 20 mon.a@0(synchronizing sync( requester state chunks )) e1  caps allow *
   -16> 2013-04-20 17:35:06.039572 7f905a33d700 10 mon.a@0(synchronizing sync( requester state chunks )) e1 handle_probe mon_probe(reply 2f2730c5-0504-4433-ae0b-331dd41d99a4 name b quorum 0,1,2 paxos( fc 429002 lc 429022 )) v4
   -15> 2013-04-20 17:35:06.039579 7f905a33d700 10 mon.a@0(synchronizing sync( requester state chunks )) e1 handle_probe_reply mon.1 10.1.0.67:6789/0mon_probe(reply 2f2730c5-0504-4433-ae0b-331dd41d99a4 name b quorum 0,1,2 paxos( fc 429002 lc 429022 )) v4
   -14> 2013-04-20 17:35:06.039585 7f905a33d700 10 mon.a@0(synchronizing sync( requester state chunks )) e1  monmap is e1: 3 mons at {a=10.1.0.3:6789/0,b=10.1.0.67:6789/0,c=10.1.0.130:6789/0}
   -13> 2013-04-20 17:35:06.039594 7f905a33d700  1 -- 10.1.0.3:6789/0 <== mon.0 10.1.0.3:6789/0 0 ==== mon_sync( heartbeat ) v1 ==== 0+0+0 (0 0 0) 0x5202b00 con 0x2652420
   -12> 2013-04-20 17:35:06.039600 7f905a33d700 20 mon.a@0(synchronizing sync( requester state chunks )) e1 have connection
   -11> 2013-04-20 17:35:06.039602 7f905a33d700 20 mon.a@0(synchronizing sync( requester state chunks )) e1 ms_dispatch existing session MonSession: mon.0 10.1.0.3:6789/0 is openallow * for mon.0 10.1.0.3:6789/0
   -10> 2013-04-20 17:35:06.039607 7f905a33d700 20 mon.a@0(synchronizing sync( requester state chunks )) e1  caps allow *
    -9> 2013-04-20 17:35:06.039609 7f905a33d700 10 mon.a@0(synchronizing sync( requester state chunks )) e1 handle_sync mon_sync( heartbeat ) v1
    -8> 2013-04-20 17:35:06.039614 7f905a33d700 10 mon.a@0(synchronizing sync( requester state chunks )) e1 handle_sync_heartbeat mon_sync( heartbeat ) v1
    -7> 2013-04-20 17:35:06.039617 7f905a33d700  1 mon.a@0(synchronizing sync( requester state chunks )) e1 handle_sync_heartbeat ignored stray message mon_sync( heartbeat ) v1
    -6> 2013-04-20 17:35:06.039621 7f905a33d700  1 -- 10.1.0.3:6789/0 <== mon.0 10.1.0.3:6789/0 0 ==== mon_sync( start_chunks ) v1 ==== 0+0+0 (0 0 0) 0x1b561340 con 0x2652420
    -5> 2013-04-20 17:35:06.039626 7f905a33d700 20 mon.a@0(synchronizing sync( requester state chunks )) e1 have connection
    -4> 2013-04-20 17:35:06.039628 7f905a33d700 20 mon.a@0(synchronizing sync( requester state chunks )) e1 ms_dispatch existing session MonSession: mon.0 10.1.0.3:6789/0 is openallow * for mon.0 10.1.0.3:6789/0
    -3> 2013-04-20 17:35:06.039634 7f905a33d700 20 mon.a@0(synchronizing sync( requester state chunks )) e1  caps allow *
    -2> 2013-04-20 17:35:06.039639 7f905a33d700 10 mon.a@0(synchronizing sync( requester state chunks )) e1 handle_sync mon_sync( start_chunks ) v1
    -1> 2013-04-20 17:35:06.039643 7f905a33d700 10 mon.a@0(synchronizing sync( requester state chunks )) e1 handle_sync_start_chunks mon_sync( start_chunks ) v1
     0> 2013-04-20 17:35:06.040463 7f905a33d700 -1 mon/Monitor.cc: In function 'void Monitor::handle_sync_start_chunks(MMonSync*)' thread 7f905a33d700 time 2013-04-20 17:35:06.039649
mon/Monitor.cc: 1126: FAILED assert(!(sync_role & SYNC_ROLE_REQUESTER))

 ceph version 0.60 (f26f7a39021dbf440c28d6375222e21c94fe8e5c)
 1: (Monitor::handle_sync_start_chunks(MMonSync*)+0x58e) [0x4b52fe]
 2: (Monitor::handle_sync(MMonSync*)+0x28b) [0x4b652b]
 3: (Monitor::_ms_dispatch(Message*)+0xc41) [0x4c1f21]
 4: (Monitor::ms_dispatch(Message*)+0x32) [0x4dcc12]
 5: (DispatchQueue::entry()+0x35b) [0x693c0b]
 6: (DispatchQueue::DispatchThread::entry()+0xd) [0x627d2d]
 7: (()+0x7f8e) [0x7f9061318f8e]
 8: (clone()+0x6d) [0x7f905ff5ee1d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
  20/20 mon
   0/10 monc
   0/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mon.a.log
--- end dump of recent events ---
2013-04-20 17:35:06.108980 7f905a33d700 -1 *** Caught signal (Aborted) **
 in thread 7f905a33d700

 ceph version 0.60 (f26f7a39021dbf440c28d6375222e21c94fe8e5c)
 1: /usr/bin/ceph-mon() [0x5869b0]
 2: (()+0xfbd0) [0x7f9061320bd0]
 3: (gsignal()+0x37) [0x7f905fe9c037]
 4: (abort()+0x148) [0x7f905fe9f698]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f90607a8e8d]
 6: (()+0x5ef76) [0x7f90607a6f76]
 7: (()+0x5efa3) [0x7f90607a6fa3]
 8: (()+0x5f1de) [0x7f90607a71de]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x43d) [0x6378cd]
 10: (Monitor::handle_sync_start_chunks(MMonSync*)+0x58e) [0x4b52fe]
 11: (Monitor::handle_sync(MMonSync*)+0x28b) [0x4b652b]
 12: (Monitor::_ms_dispatch(Message*)+0xc41) [0x4c1f21]
 13: (Monitor::ms_dispatch(Message*)+0x32) [0x4dcc12]
 14: (DispatchQueue::entry()+0x35b) [0x693c0b]
 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x627d2d]
 16: (()+0x7f8e) [0x7f9061318f8e]
 17: (clone()+0x6d) [0x7f905ff5ee1d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
    -4> 2013-04-20 17:35:06.060714 7f9053f2d700  2 -- 10.1.0.3:6789/0 >> 10.1.0.67:0/1255 pipe(0x8e42d900 sd=45 :6789 s=2 pgs=95 cs=1 l=1).reader couldn't read tag, Success
    -3> 2013-04-20 17:35:06.060742 7f9053f2d700  2 -- 10.1.0.3:6789/0 >> 10.1.0.67:0/1255 pipe(0x8e42d900 sd=45 :6789 s=2 pgs=95 cs=1 l=1).fault 0: Success
    -2> 2013-04-20 17:35:06.078338 7f9057056700  2 -- 10.1.0.3:6789/0 >> 10.1.0.67:0/1271 pipe(0x8e42d680 sd=46 :6789 s=2 pgs=95 cs=1 l=1).reader couldn't read tag, Success
    -1> 2013-04-20 17:35:06.078368 7f9057056700  2 -- 10.1.0.3:6789/0 >> 10.1.0.67:0/1271 pipe(0x8e42d680 sd=46 :6789 s=2 pgs=95 cs=1 l=1).fault 0: Success
     0> 2013-04-20 17:35:06.108980 7f905a33d700 -1 *** Caught signal (Aborted) **
 in thread 7f905a33d700

 ceph version 0.60 (f26f7a39021dbf440c28d6375222e21c94fe8e5c)
 1: /usr/bin/ceph-mon() [0x5869b0]
 2: (()+0xfbd0) [0x7f9061320bd0]
 3: (gsignal()+0x37) [0x7f905fe9c037]
 4: (abort()+0x148) [0x7f905fe9f698]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f90607a8e8d]
 6: (()+0x5ef76) [0x7f90607a6f76]
 7: (()+0x5efa3) [0x7f90607a6fa3]
 8: (()+0x5f1de) [0x7f90607a71de]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x43d) [0x6378cd]
 10: (Monitor::handle_sync_start_chunks(MMonSync*)+0x58e) [0x4b52fe]
 11: (Monitor::handle_sync(MMonSync*)+0x28b) [0x4b652b]
 12: (Monitor::_ms_dispatch(Message*)+0xc41) [0x4c1f21]
 13: (Monitor::ms_dispatch(Message*)+0x32) [0x4dcc12]
 14: (DispatchQueue::entry()+0x35b) [0x693c0b]
 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x627d2d]
 16: (()+0x7f8e) [0x7f9061318f8e]
 17: (clone()+0x6d) [0x7f905ff5ee1d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
  20/20 mon
   0/10 monc
   0/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mon.a.log
--- end dump of recent events ---
root@node2:/var/log/ceph# 

History

#1 Updated by Joao Eduardo Luis almost 11 years ago

  • Project changed from devops to Ceph
  • Category set to Monitor
  • Target version set to v0.61 - Cuttlefish

#2 Updated by Joao Eduardo Luis almost 11 years ago

  • Description updated (diff)

#3 Updated by Greg Farnum almost 11 years ago

  • Assignee set to Greg Farnum

No idea what's going on here, but I'm going to start looking into it.

#4 Updated by Sage Weil almost 11 years ago

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2013-04-23_11:03:00-rados-next-testing-basic/126

with logs

#5 Updated by Greg Farnum almost 11 years ago

Okay, this appears to be happening because the elected leader is too far behind, so it starts syncing and the system seems to have a hole in that case with the state machine and such. I'm seeing a lot of bounced messages between mon.a and mon.c, which it eventually gets past somehow in order to hit this assert.

Joao suggests the right way to handle this is by preventing an out-of-quorum monitor from being elected leader. If that's feasible, or if we can make them deterministically agree based on committed versions instead of IPs, it's probably the best solution. These all are a bit bigger than I'd like to merge in 4 days before a major version release, though...

#6 Updated by Joao Eduardo Luis almost 11 years ago

huh, forgot to mention that there is one case in which an out-of-quorum monitor must be elected the (sync) leader in order for another monitor to be able to sync: when we add a new monitor to an 1 monitor cluster.

At that point, the only monitor able to provide the sync is out-of-quorum, and that's why we tolerate choosing an out-of-quorum monitor as the leader -- maybe a simple check for monmap.size() == 2 ought to be enough in this case? The code should guarantee that only the monitor with the lowest version should attempt to sync from the other.

Does this make sense to anyone else besides me?

#7 Updated by Greg Farnum almost 11 years ago

  • Status changed from New to In Progress

Okay, discussed this a little more and there was some confusion about whether we were discussing the cluster leader or the sync leader. :)
The sync code just wasn't designed with the thought that the cluster leader might be the one doing the sync, so we need to either prevent that state from occurring or modify the sync code. I'm looking at adding paxos commit versions to the election messages so a monitor just won't vote for anybody who's too far in the past.

#8 Updated by Greg Farnum almost 11 years ago

  • Status changed from In Progress to 12

Actually, we can't just send along the versions because then the voters need global state in order to respond to each individual message, which is a bigger Elector class/algorithm change than I want to be making right now. (I had initially thought we could do it on an individual message basis, but in fact that could lead to two old monitors and one new monitor electing an old monitor as leader and then syncing to the new monitor.)

So here I go on to exploring if we can either fix up the sync code, or have a monitor that needs to get synced abdicate in the next election.

#9 Updated by Sage Weil almost 11 years ago

One thought: before the election they do the probe step.. maybe a simple flag in the election that says "i think i'm eligible to be leader", based on what is learned (about self) during probe, would be sufficient? if they turn out to be wrong and get elected, they can call a new election and not flag themselves the next time around.

#10 Updated by Greg Farnum almost 11 years ago

Ah, that would make self-abdication a more palatable solution indeed.

#11 Updated by Greg Farnum almost 11 years ago

  • Status changed from 12 to In Progress

Okay, so mon.a lost quorum leadership for about 13 seconds without noticing; looks like it got stuck waiting for paxos_readable on the osdmap — hopefully just because there were a bunch of map changes built up? (And possibly those forwarding multiplications that Sage fixed yesterday.)
Then it finally processes the election message which resulted in mon.b becoming leader 13 seconds ago and responds and they all go into the next epoch election in response to that which results in mon.a becoming the leader. But meanwhile things have been happening and it's now too far behind.

Normally this is dealt with by the probing phase, in which if a monitor is too far behind (as measured by max_paxos_join_drift) it will do a sync instead of trying to join via an election. But here the monitor is processing an election request from somebody else, and so it doesn't go through probing or anything because last it heard it was in-quorum! Considering now...

#12 Updated by Greg Farnum almost 11 years ago

Okay, given that setup the only way to be participating in an election but too far behind is if we've been alive but fell behind without noticing. So I think we can just add paxos versions to the election messages, and use them on the receiving end to determine if we defer to the other person or not. We can count on the peer to have already checked for themselves if they're behind, so if they're in the right election cycle they are up to date — we only need to make sure that we believe we dominate them if we should be elected in the first place.

#13 Updated by Greg Farnum almost 11 years ago

  • Status changed from In Progress to Fix Under Review

wip-4793. Waiting for it to build so I can kick off some teuthology tests.

#14 Updated by Greg Farnum almost 11 years ago

  • Status changed from Fix Under Review to 7

Sage reviewed this; unfortunately schedule_suite and teuthology failed somehow so these tests didn't actually run. :(

#15 Updated by Greg Farnum almost 11 years ago

  • Status changed from 7 to Resolved

It passed and Sage merged it into next with 42ab1f4561cde4c724849c41a7929c93d89e89d9

Also available in: Atom PDF