Project

General

Profile

Actions

Bug #5176

closed

leveldb: Compaction makes things time-out yielding spurious elections

Added by Sylvain Munaut almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Monitor
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
cuttlefish
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

It seems that compaction can take a few seconds (despite running on 10k SAS disks) and can cause peons to not renew the lease on time.

The problem is made worse by some logic issue in the mon.

Once the compaction has run and took some time, it may end up "propose_queued", and this cancels the "lease_renew" timeout. The problem is that this does not actually trigger an immediate renew, the actual renew will only happen at the end of the update cycle which will take a few second by itself and by then the lease will have expired.

Before cancelling the lease_renew timeout, it should check if there is enough time for an update cycle or if it should trigger a lease renew immediately, this would give much more margin for the leveldb to make its thing without breaking quorum and forcing elections for nothing (and potentially ejecting a mon and triggering an HEALTH_WARN, triggerng any monitoring system you might be using)

Actions

Also available in: Atom PDF