Project

General

Profile

Actions

Bug #49572

closed

MON_DOWN: mon.c fails to join quorum after un-blacklisting mon.a

Added by Sage Weil about 3 years ago. Updated about 3 years ago.

Status:
Duplicate
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/sage-2021-03-01_20:24:37-rados-wip-sage-testing-2021-03-01-1118-distro-basic-smithi/5924612

it looks like the sequence is something like

- mon.b is leader
- mon.a is un-blacklisted
- election is called
- each time mon.a sends it's election propose to mon.c, mon.c is still in probing state.
- mon.c calls an election, then receives hte propose from mon.b
- mon.a forms a quorum with a,b
- 5 seconds later mon.c triggers a new election
repeat?

a sample from mon.c:

2021-03-02T03:09:17.300+0000 7fc9f5cbb700  1 -- v1:172.21.15.78:6791/0 <== mon.0 v1:172.21.15.78:6789/0 3041 ==== election(11f0bdd5-5d76-4cbd-8331-0c5ee3ec9149 propose rel 17 e31) v9 ==== 932+0+0 (unknown 4188255721 0 0) 0x5580f05e73c0 con 0x5580ef556800
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 20 mon.c@2(probing) e7 _ms_dispatch existing session 0x5580ef4a5600 for mon.0
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 20 mon.c@2(probing) e7  entity  caps allow *
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 20 is_capable service=mon command= read addr v1:172.21.15.78:6789/0 on cap allow *
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 20  allow so far , doing grant allow *
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 20  allow all
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 20 is_capable service=mon command= exec addr v1:172.21.15.78:6789/0 on cap allow *
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 20  allow so far , doing grant allow *
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 20  allow all
...
2021-03-02T03:09:17.300+0000 7fc9f5cbb700  1 -- v1:172.21.15.78:6791/0 <== mon.0 v1:172.21.15.78:6789/0 3042 ==== mon_probe(reply 11f0bdd5-5d76-4cbd-8331-0c5ee3ec9149 name a paxos( fc 1 lc 628 ) mon_release quincy) v7 ==== 363+0+0 (unknown 2710756693 0 0) 0x5580f111fc00 con 0x5580ef556800
2021-03-02T03:09:17.300+0000 7fc9f4cb9700 20 Putting signature in client message(seq # 725): sig = 8974824467410395911
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 20 mon.c@2(probing) e7 _ms_dispatch existing session 0x5580ef4a5600 for mon.0
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 20 mon.c@2(probing) e7  entity  caps allow *
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 20 is_capable service=mon command= read addr v1:172.21.15.78:6789/0 on cap allow *
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 20  allow so far , doing grant allow *
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 20  allow all
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 10 mon.c@2(probing) e7 handle_probe mon_probe(reply 11f0bdd5-5d76-4cbd-8331-0c5ee3ec9149 name a paxos( fc 1 lc 628 ) mon_release quincy) v7
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 10 mon.c@2(probing) e7 handle_probe_reply mon.0 v1:172.21.15.78:6789/0 mon_probe(reply 11f0bdd5-5d76-4cbd-8331-0c5ee3ec9149 name a paxos( fc 1 lc 628 ) mon_release quincy) v7
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 10 mon.c@2(probing) e7  monmap is e7: 3 mons at {a=v1:172.21.15.78:6789/0,b=v1:172.21.15.78:6790/0,c=v1:172.21.15.78:6791/0}
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 10 mon.c@2(probing) e7  peer name is a
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 10 mon.c@2(probing) e7  mon.a is outside the quorum
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 10 mon.c@2(probing) e7  outside_quorum now a,c, need 2
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 10 mon.c@2(probing) e7  that's enough to form a new quorum, calling election
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 10 mon.c@2(probing) e7 start_election
...
2021-03-02T03:09:17.300+0000 7fc9f5cbb700  0 log_channel(cluster) log [INF] : mon.c calling monitor election
...
2021-03-02T03:09:17.300+0000 7fc9f5cbb700  1 -- v1:172.21.15.78:6791/0 <== mon.1 v1:172.21.15.78:6790/0 736 ==== election(11f0bdd5-5d76-4cbd-8331-0c5ee3ec9149 propose rel 17 e31) v9 ==== 932+0+0 (unknown 927397702 0 0) 0x5580f064bf80 con 0x5580ef555c00
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 20 mon.c@2(electing) e7 _ms_dispatch existing session 0x5580ef4a53c0 for mon.1
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 20 mon.c@2(electing) e7  entity  caps allow *
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 20 is_capable service=mon command= read addr v1:172.21.15.78:6790/0 on cap allow *
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 20  allow so far , doing grant allow *
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 20  allow all
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 20 is_capable service=mon command= exec addr v1:172.21.15.78:6790/0 on cap allow *
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 20  allow so far , doing grant allow *
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 20  allow all
2021-03-02T03:09:17.300+0000 7fc9f5cbb700  5 mon.c@2(electing).elector(31) handle_propose from mon.1
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 10 mon.c@2(electing).elector(31) handle_propose required features 2449958755906961412 mon_feature_t([kraken,luminous,mimic,osdmap-prune,nautilus,octopus,pacific,elector-pinging,quincy]), peer features 4540138303579357183 mon_feature_t([kraken,luminous,mimic,osdmap-prune,nautilus,octopus,pacific,elector-pinging,quincy])
2021-03-02T03:09:17.300+0000 7fc9f5cbb700 10 paxos.2).electionLogic(31) propose from rank=1,score=2; my score=2; currently acked -1,score=-1
2021-03-02T03:09:17.300+0000 7fc9f5cbb700  5 paxos.2).electionLogic(31) defer to 1, disallowed_leaders=
...


Related issues 1 (0 open1 closed)

Is duplicate of RADOS - Bug #47654: test_mon_pg: mon fails to join quorum to due election strategy mismatchResolvedGreg Farnum

Actions
Actions #1

Updated by Neha Ojha about 3 years ago

  • Status changed from New to Duplicate

This is the same as https://tracker.ceph.com/issues/47654

2021-03-02T03:09:22.696+0000 7fb8f131e700  1 -- v1:172.21.15.78:6790/0 <== mon.2 v1:172.21.15.78:6791/0 742 ==== election(11f0bdd5-5d76-4cbd-8331-0c5ee3ec9149 propose rel 17 e33) v9 ==== 932+0+0 (unknown 2884026088 0 0) 0x5579d5505200 con 0x5579d2609c00
2021-03-02T03:09:22.696+0000 7fb8f131e700 20 mon.b@1(electing) e8 _ms_dispatch existing session 0x5579d2559180 for mon.2
2021-03-02T03:09:22.696+0000 7fb8f131e700 20 mon.b@1(electing) e8  entity  caps allow *
2021-03-02T03:09:22.696+0000 7fb8f131e700 20 is_capable service=mon command= read addr v1:172.21.15.78:6791/0 on cap allow *
2021-03-02T03:09:22.696+0000 7fb8f131e700 20  allow so far , doing grant allow *
2021-03-02T03:09:22.696+0000 7fb8f131e700 20  allow all
2021-03-02T03:09:22.696+0000 7fb8f131e700 20 is_capable service=mon command= exec addr v1:172.21.15.78:6791/0 on cap allow *
2021-03-02T03:09:22.696+0000 7fb8f131e700 20  allow so far , doing grant allow *
2021-03-02T03:09:22.696+0000 7fb8f131e700 20  allow all
2021-03-02T03:09:22.696+0000 7fb8f131e700  5 mon.b@1(electing).elector(33) dispatch somehow got an Election message with different strategy ^C from local 1; dropping for now to let race resolve
Actions #2

Updated by Neha Ojha about 3 years ago

  • Is duplicate of Bug #47654: test_mon_pg: mon fails to join quorum to due election strategy mismatch added
Actions

Also available in: Atom PDF