Bug #3587
closedmon: election doesn't finish during heavy mon thrashing
0%
Description
While trying to trigger #3495 using
$ while [ 1 ]; do ./init-ceph restart mon.a ; sleep 30 ; done $ while [ 1 ]; do ./init-ceph restart mds.a ; sleep 2 ; done $ while [ 1 ]; do ./init-ceph restart osd.1 ; sleep 2 ; done
At a certain point in time, mon.a got stuck electing (which was noticed after canceling its restart loop). My suspicion is that it happened after #3495 was triggered on mon.b during, or right before, an election cycle.
I've attached both mon.a's and mon.b's logs; mon.b's log does have the stack trace from #3495, but might be useful to further inquire what has happened in case its failure had anything to do with the infinite election cycle.
Files
Updated by Joao Eduardo Luis over 11 years ago
- Subject changed from mon: election doesn't finish during heavy osd/mds thrashing to mon: election doesn't finish during heavy mon thrashing
Updated by Joao Eduardo Luis over 11 years ago
This is being caused by the fact that, from the other monitors point-of-view, mon.a never left the quorum, thus they just ignore its election proposals as being 'old'.
Also, there's the fact that the elector class is writing its election epochs to the store, each time they are bumped, but never reads them. This means that the monitor will always start with a election epoch of 1, regardless the last election it has seen. For this particular case, reading the election epoch would help, as it is the same as the remaining monitors and the election proposal would then go through. This is a corner-case, and it should be guaranteed that, when it happens, the other monitors will always have the same election epoch as mon.a; otherwise, it would mean that a new quorum had been formed, without mon.a in it, and we wouldn't stumble upon this situation.
Also, attaching mon.c's log, as it was the one that proved to bear more insight into the matter.
Updated by Joao Eduardo Luis over 11 years ago
- Status changed from New to Fix Under Review
Haven't been able to reproduce the bug since commit e6c15e73543593fc55ba3846197fb7f83f949bb7 from wip-3587.
Updated by Sage Weil over 11 years ago
- Status changed from Fix Under Review to Resolved