Project

General

Profile

Bug #24445

Monitor election continues for 10 minutes and Ceph cluster at inaccessible state

Added by Suvendu Mitra about 1 year ago. Updated 4 months ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
-
Start date:
06/07/2018
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Ceph version - 12.2.2, Blue store
Election starts on 2018-06-05 03:12:13 and End at 2018-06-05 03:22:22.
Controller-1 several time leader selected but still continue election
Some portion relevant log from all controller. Log cann't be attached as size of log more than 1MB. If you need more info please ask. ============
Controller-1 =============
2018-06-05 03:12:13.808945 7f2d59ebb700 0 log_channel(cluster) log [INF] : mon.controller-1 calling new monitor election
2018-06-05 03:12:13.809005 7f2d59ebb700 1 mon.controller-1@0(electing).elector(75) init, last seen epoch 75, mid-election, bumping
2018-06-05 03:12:13.817176 7f2d59ebb700 0 log_channel(cluster) log [INF] : mon.controller-1@0 won leader election with quorum 0,1,2
2018-06-05 03:12:15.320861 7f2d59ebb700 0 log_channel(cluster) log [INF] : mon.controller-1 calling new monitor election
2018-06-05 03:12:15.320935 7f2d59ebb700 1 mon.controller-1@0(electing).elector(79) init, last seen epoch 79, mid-election, bumping
2018-06-05 03:12:16.252411 7f2d59ebb700 0 log_channel(cluster) log [INF] : mon.controller-1@0 won leader election with quorum 0,1,2
2018-06-05 03:12:17.756912 7f2d59ebb700 0 log_channel(cluster) log [INF] : mon.controller-1 calling new monitor election
2018-06-05 03:12:17.756974 7f2d59ebb700 1 mon.controller-1@0(electing).elector(83) init, last seen epoch 83, mid-election, bumping
2018-06-05 03:12:18.835632 7f2d59ebb700 0 log_channel(cluster) log [INF] : mon.controller-1@0 won leader election with quorum 0,1,2
2018-06-05 03:12:20.351715 7f2d59ebb700 0 log_channel(cluster) log [INF] : mon.controller-1 calling new monitor election
2018-06-05 03:12:20.351771 7f2d59ebb700 1 mon.controller-1@0(electing).elector(87) init, last seen epoch 87, mid-election, bumping
2018-06-05 03:12:21.275099 7f2d59ebb700 0 log_channel(cluster) log [INF] : mon.controller-1@0 won leader election with quorum 0,1,2
2018-06-05 03:12:22.791736 7f2d59ebb700 0 log_channel(cluster) log [INF] : mon.controller-1 calling new monitor election
....
2018-06-05 03:21:00.909635 7f2d59ebb700 0 log_channel(cluster) log [INF] : mon.controller-1 calling new monitor election
2018-06-05 03:21:00.909726 7f2d59ebb700 1 mon.controller-1@0(electing).elector(939) init, last seen epoch 939, mid-election, bumping
2018-06-05 03:21:01.835448 7f2d59ebb700 0 log_channel(cluster) log [INF] : mon.controller-1@0 won leader election with quorum 0,1,2
2018-06-05 03:21:03.339673 7f2d59ebb700 0 log_channel(cluster) log [INF] : mon.controller-1 calling new monitor election
2018-06-05 03:21:03.339767 7f2d59ebb700 1 mon.controller-1@0(electing).elector(943) init, last seen epoch 943, mid-election, bumping
2018-06-05 03:21:04.286083 7f2d59ebb700 0 log_channel(cluster) log [INF] : mon.controller-1@0 won leader election with quorum 0,1,2
2018-06-05 03:21:05.790830 7f2d59ebb700 0 log_channel(cluster) log [INF] : mon.controller-1 calling new monitor election
2018-06-05 03:21:05.790952 7f2d59ebb700 1 mon.controller-1@0(electing).elector(947) init, last seen epoch 947, mid-election, bumping
2018-06-05 03:21:06.736903 7f2d59ebb700 0 log_channel(cluster) log [INF] : mon.controller-1@0 won leader election with quorum 0,1,2
2018-06-05 03:21:08.263558 7f2d59ebb700 0 log_channel(cluster) log [INF] : mon.controller-1 calling new monitor election
2018-06-05 03:21:08.263647 7f2d59ebb700 1 mon.controller-1@0(electing).elector(951) init, last seen epoch 951, mid-election, bumping
2018-06-05 03:21:09.064754 7f2d59ebb700 0 log_channel(cluster) log [INF] : mon.controller-1@0 won leader election with quorum 0,1,2
2018-06-05 03:21:10.586801 7f2d59ebb700 0 log_channel(cluster) log [INF] : mon.controller-1 calling new monitor election
2018-06-05 03:21:10.586882 7f2d59ebb700 1 mon.controller-1@0(electing).elector(955) init, last seen epoch 955, mid-election, bumping
2018-06-05 03:21:12.601758 7f2d5c6c0700 0 log_channel(cluster) log [INF] : mon.controller-1@0 won leader election with quorum 0,2
2018-06-05 03:22:22.027875 7f2d59ebb700 0 log_channel(cluster) log [INF] : mon.controller-1 calling new monitor election
2018-06-05 03:22:22.042367 7f2d59ebb700 0 log_channel(cluster) log [INF] : mon.controller-1 calling new monitor election
2018-06-05 03:22:22.042459 7f2d59ebb700 1 mon.controller-1@0(electing).elector(961) init, last seen epoch 961, mid-election, bumping
2018-06-05 03:22:22.057284 7f2d59ebb700 0 log_channel(cluster) log [INF] : mon.controller-1@0 won leader election with quorum 0,1,2

============
Controller-2 =============
2018-06-05 03:21:03.476421 7f49e3bbf700 1 mon.controller-2@1(electing).elector(943) init, last seen epoch 943, mid-election, bumping
2018-06-05 03:21:05.902118 7f49e3bbf700 0 log_channel(cluster) log [INF] : mon.controller-2 calling new monitor election
2018-06-05 03:21:05.902244 7f49e3bbf700 1 mon.controller-2@1(electing).elector(947) init, last seen epoch 947, mid-election, bumping
2018-06-05 03:21:08.275410 7f49e3bbf700 0 log_channel(cluster) log [INF] : mon.controller-2 calling new monitor election
2018-06-05 03:21:08.275514 7f49e3bbf700 1 mon.controller-2@1(electing).elector(951) init, last seen epoch 951, mid-election, bumping
2018-06-05 03:21:13.544838 7f49e3bbf700 0 log_channel(cluster) log [INF] : mon.controller-2 calling new monitor election
2018-06-05 03:21:13.544932 7f49e3bbf700 1 mon.controller-2@1(electing).elector(955) init, last seen epoch 955, mid-election, bumping
2018-06-05 03:21:16.562804 7f49e63c4700 1 mon.controller-2@1(electing).elector(959) init, last seen epoch 959, mid-election, bumping
2018-06-05 03:21:19.349413 7f49e63c4700 0 log_channel(cluster) log [INF] : mon.controller-2@1 won leader election with quorum 1,2
2018-06-05 03:22:22.042163 7f49e3bbf700 0 log_channel(cluster) log [INF] : mon.controller-2 calling new monitor election

============
Controller-3 =============
2018-06-04 03:17:11.255774 7f6da2621700 1 mon.controller-3@2(peon).paxos(paxos updating c 142067..142734) lease_timeout -- calling new election
2018-06-05 03:12:13.088705 7f6da2621700 1 mon.controller-3@2(peon).paxos(paxos active c 221383..221999) lease_timeout -- calling new election
2018-06-05 03:12:15.318580 7f6da2621700 1 mon.controller-3@2(peon).paxos(paxos recovering c 221383..221999) lease_timeout -- calling new election
2018-06-05 03:12:17.753885 7f6da2621700 1 mon.controller-3@2(peon).paxos(paxos recovering c 221383..221999) lease_timeout -- calling new election
2018-06-05 03:12:20.337063 7f6da2621700 1 mon.controller-3@2(peon).paxos(paxos updating c 221383..221999) lease_timeout -- calling new election
2018-06-05 03:12:22.776650 7f6da2621700 1 mon.controller-3@2(peon).paxos(paxos updating c 221383..221999) lease_timeout -- calling new election
2018-06-05 03:12:25.266439 7f6da2621700 1 mon.controller-3@2(peon).paxos(paxos updating c 221383..221999) lease_timeout -- calling new election
2018-06-05 03:12:27.750021 7f6da2621700 1 mon.controller-3@2(peon).paxos(paxos recovering c 221383..221999) lease_timeout -- calling new election
2018-06-05 03:12:30.200396 7f6da2621700 1 mon.controller-3@2(peon).paxos(paxos updating c 221383..221999) lease_timeout -- calling new election
2018-06-05 03:12:32.851216 7f6da2621700 1 mon.controller-3@2(peon).paxos(paxos recovering c 221383..221999) lease_timeout -- calling new election
2018-06-05 03:12:35.287178 7f6da2621700 1 mon.controller-3@2(peon).paxos(paxos recovering c 221383..221999) lease_timeout -- calling new election
2018-06-05 03:12:37.627028 7f6da2621700 1 mon.controller-3@2(peon).paxos(paxos updating c 221383..221999) lease_timeout -- calling new election
2018-06-05 03:12:40.065024 7f6da2621700 1 mon.controller-3@2(peon).paxos(paxos updating c 221383..221999) lease_timeout -- calling new election
2018-06-05 03:12:42.545764 7f6da2621700 1 mon.controller-3@2(peon).paxos(paxos recovering c 221383..221999) lease_timeout -- calling new election
2018-06-05 03:12:45.126900 7f6da2621700 1 mon.controller-3@2(peon).paxos(paxos recovering c 221383..221999) lease_timeout -- calling new election
2018-06-05 03:12:47.667329 7f6da2621700 1 mon.controller-3@2(peon).paxos(paxos updating c 221383..221999) lease_timeout -- calling new election
.............

History

#1 Updated by Joao Eduardo Luis about 1 year ago

  • Status changed from New to Need More Info

The monitors are calling new elections because the paxos lease is timing out:

Controller-3 =============
2018-06-04 03:17:11.255774 7f6da2621700 1 mon.controller-3@2(peon).paxos(paxos updating c 142067..142734) lease_timeout -- calling new election
2018-06-05 03:12:13.088705 7f6da2621700 1 mon.controller-3@2(peon).paxos(paxos active c 221383..221999) lease_timeout -- calling new election
2018-06-05 03:12:15.318580 7f6da2621700 1 mon.controller-3@2(peon).paxos(paxos recovering c 221383..221999) lease_timeout -- calling new election
2018-06-05 03:12:17.753885 7f6da2621700 1 mon.controller-3@2(peon).paxos(paxos recovering c 221383..221999) lease_timeout -- calling new election

This is typically due to either network issues, or overload on the monitors, and does not necessarily mean there's a bug.

#2 Updated by Joao Eduardo Luis about 1 year ago

  • Category set to Monitor
  • Source set to other

#3 Updated by Joao Eduardo Luis about 1 year ago

To ascertain if there's any misbehavior on the monitors, please run the monitors with `debug ms = 1`, and check when the paxos lease messages and their acks are sent, and received. You may want to play with the paxos lease timeout configurables, to tolerate slower networks or overloaded monitors.

#4 Updated by Suvendu Mitra about 1 year ago

Thanks for the update we will run test again with
""
[global]
debug ms = 1/5
""
Regrading paxos lease timeout I can't find any parameter in ceph.conf do you meant "mon lease" and "mon lease ack timeout factor"

#5 Updated by Swami Reddy 4 months ago

Hello - we are using 12.2.4 version and hitting same bug today.
Are there any work-around for this?

Is it fixed ?

#6 Updated by Swami Reddy 4 months ago

Affected Versions:
v12.2.4

#7 Updated by Joao Eduardo Luis 4 months ago

hi Swami,

please refer to my former comment (https://tracker.ceph.com/issues/24445#note-3) and provide the information I asked. The original reporter never got back to us, so I can only presume their issues went away.

Also available in: Atom PDF