Bug #3835: mon: timecheck: hits FAILED assert(m->epoch == timecheck_epoch) when monitors are severely lagged - Ceph - Ceph

Actions

Copy link

Bug #3835

closed

mon: timecheck: hits FAILED assert(m->epoch == timecheck_epoch) when monitors are severely lagged

Added by Joao Eduardo Luis over 11 years ago. Updated over 11 years ago.

Status:

Resolved

Priority:

High

Assignee:

Sage Weil

Category:

Monitor

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

The leader schedules a new 'ping' to the monitors in the quorum as soon as the pings are all sent.

This allows for a new ping to be scheduled before the pongs are received, thus hitting the assert.

The solution is to schedule a new ping only after all the pongs have been gathered. If a monitors fails and doesn't answer to a ping, then we will wait for a new election to happen to clean up the state (as is currently done); this would however mean that we would go without an updated latency for the failed monitor until the quorum is updated, but that's okay given that we would clean up the time check statistics anyway after the election.

~~Also, unrelated to the bug, but I'll take this opportunity to fix it: the timecheck epochs are being incremented by two instead of by one. This poses absolutely no problem, but it's worth fixing.~~

Update: timecheck epochs are increased by two because they are mapped to election epochs. For a moment there I was confusing 'timecheck epochs' with 'timecheck rounds': 'epochs' map to election epochs, 'rounds' map to the round within each election epoch.

The problem appears to be that, a severely lagged monitor can reply to an older timecheck epoch simply because it is not aware of any elections yet, not is it aware that it is not in the quorum. See this transcript:

07:24:35.872807 7f3b937fe700 10 mon.c@0(leader) e1 timecheck start timecheck epoch 916 round 1
07:24:35.872812 7f3b937fe700 10 mon.c@0(leader) e1 timecheck send time_check( ping e 916 r 1 ) v1 to mon.1 10.214.136.16:6789/0
07:24:35.872816 7f3b937fe700  1 -- 10.214.136.8:6789/0 --> mon.1 10.214.136.16:6789/0 -- time_check( ping e 916 r 1 ) v1 -- ?+0 0x7f3b796b3110
07:24:35.872822 7f3b937fe700 10 mon.c@0(leader) e1 timecheck send time_check( ping e 916 r 1 ) v1 to mon.2 10.214.136.18:6789/0
07:24:35.872826 7f3b937fe700  1 -- 10.214.136.8:6789/0 --> mon.2 10.214.136.18:6789/0 -- time_check( ping e 916 r 1 ) v1 -- ?+0 0x7f3b781b6e50

[snip]

07:24:35.873849 7f3b937fe700  1 -- 10.214.136.8:6789/0 <== mon.2 10.214.136.18:6789/0 496413298 ==== time_check( pong e 916 r 1 ts 2013-01-17 07:24:35.872635 ) v1 ==== 36+0+0 (4081874320 0 0) 0x7f3b7c002ad0 con 0x1beb7e0

[snip]

07:24:50.759128 7f3b93fff700  5 mon.c@0(electing).elector(917) election timer expired
07:24:50.759146 7f3b93fff700 10 mon.c@0(electing).elector(917) bump_epoch 917 to 918
07:24:50.759150 7f3b93fff700 15 store(/var/lib/ceph/mon/ceph-c) set_int election_epoch = 918
07:24:50.846152 7f3b93fff700  1 -- 10.214.136.8:6789/0 --> mon.2 10.214.136.18:6789/0 -- election(107416f0-9ef5-403a-ae3b-f907bab551e3 victory 918) v3 -- ?+0 0x7f3b6c064b70
07:24:50.846189 7f3b93fff700 10 mon.c@0(leader) e1 win_election, epoch 918 quorum is 0,2 features are 268435455
07:24:50.846198 7f3b93fff700  0 log [INF] : mon.c@0 won leader election with quorum 0,2
[snip]
07:24:51.089980 7f3b93fff700 10 mon.c@0(leader) e1 timecheck
07:24:51.089982 7f3b93fff700 10 mon.c@0(leader) e1 timecheck start timecheck epoch 918 round 1
07:24:51.089987 7f3b93fff700 10 mon.c@0(leader) e1 timecheck send time_check( ping e 918 r 1 ) v1 to mon.2 10.214.136.18:6789/0
07:24:51.089991 7f3b93fff700  1 -- 10.214.136.8:6789/0 --> mon.2 10.214.136.18:6789/0 -- time_check( ping e 918 r 1 ) v1 -- ?+0 0x7f3b6c73e430
07:24:51.090000 7f3b93fff700 10 mon.c@0(leader) e1 timecheck setting up next event and timeout

[snip]

07:25:03.766806 7f3b937fe700  1 -- 10.214.136.8:6789/0 <== mon.1 10.214.136.16:6789/0 902699968 ==== time_check( pong e 916 r 1 ts 2013-01-17 07:25:03.509240 ) v1 ==== 36+0+0 (2051345575 0 0) 0x7f3b74000c90 con 0x1bead70
07:25:03.766813 7f3b937fe700 20 mon.c@0(leader) e1 have connection
07:25:03.766815 7f3b937fe700 20 mon.c@0(leader) e1 ms_dispatch existing session MonSession: mon.1 10.214.136.16:6789/0 is openallow * for mon.1 10.214.136.16:6789/0
07:25:03.766818 7f3b937fe700 20 mon.c@0(leader) e1  caps allow *
07:25:03.766820 7f3b937fe700 10 mon.c@0(leader) e1 handle_timecheck time_check( pong e 916 r 1 ts 2013-01-17 07:25:03.509240 ) v1
07:25:03.766824 7f3b937fe700 10 mon.c@0(leader) e1 handle_timecheck_leader time_check( pong e 916 r 1 ts 2013-01-17 07:25:03.509240 ) v1
07:25:03.789544 7f3b937fe700 -1 mon/Monitor.cc: In function 'void Monitor::handle_timecheck_leader(MTimeCheck*)' thread 7f3b937fe700 time 2013-01-17 07:25:03.766828
mon/Monitor.cc: 2346: FAILED assert(m->epoch == timecheck_epoch)

See,

we initially send the timecheck to both monitors 1 and 2;
mon.2 replies soon after;
eventually a new election is triggered due to a collect timeout (mon.1's at fault here);
the election finished with quorum 0,2;
a new timecheck epoch starts, bumping the epoch to the new election epoch (e 918);
a ping is sent to mon.2;
mon.1 finally answers to timecheck epoch 916, roughly 30 seconds (and an election) after.