Project

General

Profile

Bug #5176

leveldb: Compaction makes things time-out yielding spurious elections

Added by Sylvain Munaut almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Monitor
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
cuttlefish
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

It seems that compaction can take a few seconds (despite running on 10k SAS disks) and can cause peons to not renew the lease on time.

The problem is made worse by some logic issue in the mon.

Once the compaction has run and took some time, it may end up "propose_queued", and this cancels the "lease_renew" timeout. The problem is that this does not actually trigger an immediate renew, the actual renew will only happen at the end of the update cycle which will take a few second by itself and by then the lease will have expired.

Before cancelling the lease_renew timeout, it should check if there is enough time for an update cycle or if it should trigger a lease renew immediately, this would give much more margin for the leveldb to make its thing without breaking quorum and forcing elections for nothing (and potentially ejecting a mon and triggering an HEALTH_WARN, triggerng any monitoring system you might be using)

History

#1 Updated by Anonymous almost 11 years ago

  • Priority changed from Normal to High

#2 Updated by Sage Weil almost 11 years ago

  • Status changed from New to Fix Under Review

wip-5176

#3 Updated by Sage Weil almost 11 years ago

  • Status changed from Fix Under Review to 7
  • Priority changed from High to Urgent
  • Backport set to cuttlefish

Sylvain, I have a wip-5176 branch that makes us compact in a background thread, and over smaller ranges. Can you give it a try? I'm also pushing wip-5176-cuttlefish on top of the latest cuttlefish, if that's what you are running.

#4 Updated by Sage Weil almost 11 years ago

  • Assignee set to Sage Weil

#5 Updated by Sage Weil almost 11 years ago

sylvain reports:

(09:32:12 AM) sagewk: tnt: did you get a chance to try the mon wip branch by chance?
(09:32:34 AM) tnt: sagewk: yes, it's been running for a few hours now.
(09:32:57 AM) tnt: sagewk: the space usage seems bounded, it doesn't grow out of control like if you disable compact on trim.
(09:33:43 AM) tnt: it can get a bit bigger at times than previously but it gets backs down progressibely in a few minutes.
(09:36:07 AM) tnt: sagewk: IO rate is lower, but not all that much. about 20-30% or so.
(09:40:31 AM) tnt: sagewk: the good news though is that I haven't had any spurious election due to time out since I deployed it.
(09:42:32 AM) tnt: I'll be adding two osds to the cluster tonight like I did yesterday and see how it goes. Yesterday it yielded a lot of IO on the mon which caused various things to timeout and a bunch of issues ...

#6 Updated by Sage Weil almost 11 years ago

  • Status changed from 7 to Pending Backport

#7 Updated by Sage Weil almost 11 years ago

  • Status changed from Pending Backport to Resolved

#8 Updated by Sylvain Munaut almost 11 years ago

fyi, I just upgraded from wip-5176 to 0.61.3 and those spurious elections are back.

#9 Updated by Sage Weil almost 11 years ago

  • Status changed from Resolved to Need More Info

Can you capture a debug mon = 20, debug paxos = 20, debug ms = 1 log that includes an election and send us the set of logs (for all 3 mons)?

#10 Updated by Sylvain Munaut almost 11 years ago

I can try to do this tomorrow.

But in the mean time I played with the paxos trimming values and made it go away.

At first, I tried with just setting : --paxos_trim_min 100 --paxos_trim_max 300
But that didn't make it go away.
Then I tried adding --paxos_service_trim_min 250 --paxos_service_trim_max 500 and this did the trick.
Not sure if all values are needed or just some ...

My guess would be that compacting larger blocks 'locks' the DB for longer and you hit the same original issue as previously. It's now 'async' and so doesn't block the thread itself, but AFAIK there is still mutex inside leveldb and if anything tries to do some IO on the DB, it will block until compaction is over.

#11 Updated by Sage Weil almost 11 years ago

  • Status changed from Need More Info to Resolved

Sylvain Munaut wrote:

I can try to do this tomorrow.

But in the mean time I played with the paxos trimming values and made it go away.

At first, I tried with just setting : --paxos_trim_min 100 --paxos_trim_max 300
But that didn't make it go away.
Then I tried adding --paxos_service_trim_min 250 --paxos_service_trim_max 500 and this did the trick.
Not sure if all values are needed or just some ...

My guess would be that compacting larger blocks 'locks' the DB for longer and you hit the same original issue as previously. It's now 'async' and so doesn't block the thread itself, but AFAIK there is still mutex inside leveldb and if anything tries to do some IO on the DB, it will block until compaction is over.

Jim Schutt also reported better behavior with smaller trim intervals. I'll adjust the defaults down to 250/500. thanks!

If you do get a chance to try this later, let us know.

Also available in: Atom PDF