Project

General

Profile

Actions

Bug #3587

closed

mon: election doesn't finish during heavy mon thrashing

Added by Joao Eduardo Luis over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
Joao Eduardo Luis
Category:
Monitor
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While trying to trigger #3495 using

$ while [ 1 ]; do ./init-ceph restart mon.a ; sleep 30 ; done
$ while [ 1 ]; do ./init-ceph restart mds.a ; sleep 2 ; done
$ while [ 1 ]; do ./init-ceph restart osd.1 ; sleep 2 ; done

At a certain point in time, mon.a got stuck electing (which was noticed after canceling its restart loop). My suspicion is that it happened after #3495 was triggered on mon.b during, or right before, an election cycle.

I've attached both mon.a's and mon.b's logs; mon.b's log does have the stack trace from #3495, but might be useful to further inquire what has happened in case its failure had anything to do with the infinite election cycle.


Files

mon.a.log (6.25 MB) mon.a.log Joao Eduardo Luis, 12/07/2012 04:19 AM
mon.b.log (4.26 MB) mon.b.log Joao Eduardo Luis, 12/07/2012 04:19 AM
mon.c.log (4.42 MB) mon.c.log Joao Eduardo Luis, 12/07/2012 05:18 AM
Actions #1

Updated by Joao Eduardo Luis over 11 years ago

  • Subject changed from mon: election doesn't finish during heavy osd/mds thrashing to mon: election doesn't finish during heavy mon thrashing
Actions #2

Updated by Joao Eduardo Luis over 11 years ago

This is being caused by the fact that, from the other monitors point-of-view, mon.a never left the quorum, thus they just ignore its election proposals as being 'old'.

Also, there's the fact that the elector class is writing its election epochs to the store, each time they are bumped, but never reads them. This means that the monitor will always start with a election epoch of 1, regardless the last election it has seen. For this particular case, reading the election epoch would help, as it is the same as the remaining monitors and the election proposal would then go through. This is a corner-case, and it should be guaranteed that, when it happens, the other monitors will always have the same election epoch as mon.a; otherwise, it would mean that a new quorum had been formed, without mon.a in it, and we wouldn't stumble upon this situation.

Also, attaching mon.c's log, as it was the one that proved to bear more insight into the matter.

Actions #3

Updated by Joao Eduardo Luis over 11 years ago

  • Status changed from New to Fix Under Review

Haven't been able to reproduce the bug since commit e6c15e73543593fc55ba3846197fb7f83f949bb7 from wip-3587.

Actions #4

Updated by Sage Weil over 11 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF