Project

General

Profile

Actions

Bug #48790

open

rados/multimon: MON_DOWN in mon_election/connectivity with mon_clock_no_skews

Added by Neha Ojha over 3 years ago. Updated over 3 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

rados/multimon/{clusters/9 mon_election/connectivity msgr-failures/many msgr/async-v1only no_pools objectstore/bluestore-low-osd-mem-target rados supported-random-distro$/{ubuntu_latest} tasks/mon_clock_no_skews}

2021-01-07T02:01:10.280 INFO:teuthology.orchestra.run.smithi132.stdout:2021-01-07T02:00:26.957803+0000 mon.b (mon.0) 18 : cluster [WRN] Health check failed: 3/9 mons down, quorum b,a,e,d,f,i (MON_DOWN)

/a/nojha-2021-01-07_00:06:49-rados-master-distro-basic-smithi/5760916

Actions #1

Updated by Neha Ojha over 3 years ago

  • Subject changed from rados/multimon: MON_DOWN in cluster log with mon_clock_no_skews to rados/multimon: MON_DOWN in mon_election/connectivity with mon_clock_no_skews

My first impression is that this is related to election_strategy connectivity.

All mons are in quorum here:

2021-01-07T02:00:21.926+0000 7f6e23fa6700 10 mon.c@2(peon) e1 lose_election, epoch 8 leader is mon1 quorum is 0,1,2,3,4,5,6,8 features are 4540138297136906239 mon_features are mon_feature_t([kraken,luminous,mimic,osdmap-prune,nautilus,octopus,pacific,elector-pinging]) min_mon_release pacific

3 MONs are detected down here:

2021-01-07T02:00:26.957803+0000 mon.b (mon.0) 18 : cluster [WRN] Health check failed: 3/9 mons down, quorum b,a,e,d,f,i (MON_DOWN)

but the test passes since all mons form quorum by now:

2021-01-07T02:00:36.919 INFO:tasks.mon_clock_skew_check.ceph_manager:quorum is size 9

Looking mon.c's log, which was one of 3 mons that was marked down, we see:

2021-01-07T02:00:21.926+0000 7f6e23fa6700 10 mon.c@2(peon) e1 lose_election, epoch 8 leader is mon1 quorum is 0,1,2,3,4,5,6,8 features are 4540138297136906239 mon_features are mon_feature_t([kraken,luminous,mimic,osdmap-prune,nautilus,octopus,pacific,elector-pinging]) min_mon_release pacific -> makes sense
2021-01-07T02:00:21.926+0000 7f6e23fa6700 10 mon.c@2(peon).paxos(paxos recovering c 1..3) peon_init -- i am a peon

2021-01-07T02:00:21.926+0000 7f6e23fa6700 20 mon.c@2(peon) e1 do_stretch_mode_election_work
2021-01-07T02:00:21.942+0000 7f6e23fa6700  1 -- v1:172.21.15.160:6789/0 <== mon.1 v1:172.21.15.132:6789/0 26 ==== election(1076525c-5b9f-4efa-80eb-f3b66af6f844 propose rel 16 e11) v9 ==== 3488+0+0 (unknown 4272849824 0 0) 0x56313fa04d80 con 0x56313ec2bc00
2021-01-07T02:00:21.942+0000 7f6e23fa6700 10 paxos.2).electionLogic(11) propose from rank=1,score=7; my score=6; currently acked -1,score=-1
2021-01-07T02:00:21.942+0000 7f6e23fa6700  5 paxos.2).electionLogic(11) defer to 1, disallowed_leaders= -> so far so good

2021-01-07T02:00:21.962+0000 7f6e23fa6700  1 -- v1:172.21.15.160:6789/0 <== mon.0 v1:172.21.15.16:6789/0 49 ==== election(1076525c-5b9f-4efa-80eb-f3b66af6f844 propose rel 16 e13) v9 ==== 3556+0+0 (unknown 42469048 0 0) 0x56313f96d840 con 0x56313ec2c800

2021-01-07T02:00:21.962+0000 7f6e23fa6700 10 paxos.2).electionLogic(13) propose from rank=0,score=7; my score=6; currently acked 1,score=7
2021-01-07T02:00:21.962+0000 7f6e23fa6700  5 paxos.2).electionLogic(13) Bumping epoch and starting new election; acked 1 should defer to 0 but there is score disagreement! -> not sure why this happened

2021-01-07T02:00:21.962+0000 7f6e23fa6700  1 -- v1:172.21.15.160:6789/0 send_to--> mon v1:172.21.15.16:6789/0 -- election(1076525c-5b9f-4efa-80eb-f3b66af6f844 propose rel 16 e15) v9 -- ?+0 0x56313fa04b40

2021-01-07T02:00:26.962+0000 7f6e23fa6700 10 paxos.2).electionLogic(21) propose from rank=0,score=8; my score=8; currently acked 7,score=8
2021-01-07T02:00:26.962+0000 7f6e23fa6700  5 paxos.2).electionLogic(21) Bumping epoch and starting new election; acked 7 should defer to 0 but there is score disagreement!

finally

2021-01-07T02:00:26.986+0000 7f6e23fa6700 10 mon.c@2(peon) e1 lose_election, epoch 26 leader is mon0 quorum is 0,1,2,3,4,5,6,7,8 features are 4540138297136906239 mon_features are mon_feature_t([kraken,luminous,mimic,osdmap-prune,nautilus,octopus,pacific,elector-pinging]) min_mon_release pacific

There seems to be a period of > 5 secs, where mon.c is in some sort of election storm.

Actions #2

Updated by Neha Ojha over 3 years ago

  • Assignee set to Greg Farnum

Greg, could you please take a look and see if my theory makes sense.

Actions

Also available in: Atom PDF