Project

General

Profile

Bug #43584

MON_DOWN during mon_join process

Added by Sage Weil about 4 years ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

100%

Source:
Development
Tags:
backport_processed
Backport:
pacific, octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/sage-2020-01-12_21:37:03-rados-wip-sage-testing-2020-01-12-0621-distro-basic-smithi/4660691

2020-01-12T22:21:05.302+0000 7f1744ff9700  1 -- v1:172.21.15.168:6789/0 send_to--> mon v1:172.21.15.173:6789/0 -- mon_join(b v1:172.21.15.168:6789/0) v2 -- ?+0 0x557b302ac240

the leader bootstraps,
2020-01-12T22:21:05.309+0000 7f1744ff9700  1 -- v1:172.21.15.168:6789/0 <== mon.0 v1:172.21.15.173:6789/0 6 ==== election(84f14cf6-3589-11ea-99da-001a4aab830c propose rel 15 e11) v8 ==== 353+0+0 (unknown 1008848899 0 0) 0x557b30148c00 con 0x557b2f1f7180

then the joiner bootstraps,
2020-01-12T22:21:07.302+0000 7f17477fe700 10 mon.b@-1(probing) e2 bootstrap

and gets the new monmap and bootstraps again,
2020-01-12T22:21:07.302+0000 7f1744ff9700 10 mon.b@-1(probing) e2 handle_probe mon_probe(reply 84f14cf6-3589-11ea-99da-001a4aab830c name c paxos( fc 1 lc 114 ) mon_release octopus) v7
2020-01-12T22:21:07.302+0000 7f1744ff9700 10 mon.b@-1(probing) e2 handle_probe_reply mon.1 v1:172.21.15.173:6790/0 mon_probe(reply 84f14cf6-3589-11ea-99da-001a4aab830c name c paxos( fc 1 lc 114 ) mon_release octopus) v7
2020-01-12T22:21:07.302+0000 7f1744ff9700 10 mon.b@-1(probing) e2  monmap is e2: 2 mons at {a=v1:172.21.15.173:6789/0,c=v1:172.21.15.173:6790/0}
2020-01-12T22:21:07.302+0000 7f1744ff9700 10 mon.b@-1(probing) e2  got newer/committed monmap epoch 3, mine was 2
2020-01-12T22:21:07.302+0000 7f1744ff9700 10 mon.b@-1(probing) e3 bootstrap

but misses out on the first election cycle, resulting in a MON_DOWN from the leader.

Related issues

Duplicated by RADOS - Bug #52724: octopus: 1/3 mons down, quorum a,c (MON_DOWN)" in cluster log' Duplicate
Copied to RADOS - Backport #52746: octopus: MON_DOWN during mon_join process Rejected
Copied to RADOS - Backport #52747: pacific: MON_DOWN during mon_join process Resolved

History

#1 Updated by Greg Farnum about 4 years ago

  • Priority changed from High to Normal

I'm pretty sure this is a test issue, since we don't make guarantees about monitor elections, especially on first boot-up. Let's see if it comes up again.

#2 Updated by Sage Weil almost 3 years ago

/a/sage-2021-02-26_22:19:00-rados-wip-sage-testing-2021-02-26-1412-distro-basic-smithi/5917141

I think this is a general problem with adding mons.. it's relatively easy to miss the first election and trigger a harmless MON_DOWN.

We could
- only issue a MON_DOWN if quorum is incomplete for some period of time (20 seconds?).
- suppress MON_DOWN for a similar amount of time just after a new monmap version is published.

the second option seems relatively unobtrusive and ought to capture most of these cases?

#3 Updated by Sage Weil over 2 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 42366

#4 Updated by Kefu Chai over 2 years ago

  • Status changed from Fix Under Review to Resolved

#5 Updated by Neha Ojha over 2 years ago

  • Status changed from Resolved to Pending Backport
  • Backport set to pacific, octopus

#6 Updated by Backport Bot over 2 years ago

  • Copied to Backport #52746: octopus: MON_DOWN during mon_join process added

#7 Updated by Backport Bot over 2 years ago

  • Copied to Backport #52747: pacific: MON_DOWN during mon_join process added

#8 Updated by Laura Flores over 1 year ago

  • Duplicated by Bug #52724: octopus: 1/3 mons down, quorum a,c (MON_DOWN)" in cluster log' added

#9 Updated by Laura Flores over 1 year ago

/a/yuriw-2022-06-14_20:42:00-rados-wip-yuri2-testing-2022-06-14-0949-octopus-distro-default-smithi/6878197

#10 Updated by Laura Flores over 1 year ago

/a/yuriw-2022-07-19_23:25:12-rados-wip-yuri2-testing-2022-07-15-0755-pacific-distro-default-smithi/6939512

#11 Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed

#12 Updated by Laura Flores over 1 year ago

/a/yuriw-2022-10-05_20:44:57-rados-wip-yuri4-testing-2022-10-05-0917-pacific-distro-default-smithi/7055594

#13 Updated by Konstantin Shalygin 3 months ago

  • Status changed from Pending Backport to Resolved
  • Assignee set to Sage Weil
  • % Done changed from 0 to 100
  • Source set to Development

Also available in: Atom PDF