Project

General

Profile

Actions

Bug #4895

closed

leveldb: mon workload makes store.db grow without bound

Added by Sage Weil almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
leveldb
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

this is confirmed to happen to mikedawson with 32KB block size when the explicit compaction is disabled (mon compact on trim = false).

he is going to upload a tarball of his (growing) store shortly.


Files

mon-log-growing.bz2 (1.12 MB) mon-log-growing.bz2 Sylvain Munaut, 05/21/2013 11:16 PM
Actions #1

Updated by Sage Weil almost 11 years ago

  • Priority changed from Urgent to High
Actions #2

Updated by Ian Colle almost 11 years ago

  • Subject changed from levedb: mon workload makes store.db grow without bound to leveldb: mon workload makes store.db grow without bound
Actions #3

Updated by Sage Weil almost 11 years ago

  • Status changed from 12 to 7

we're hoping the #4851 forking thing will fix this too.. mikedawson is testing!

Actions #4

Updated by Sage Weil almost 11 years ago

  • Status changed from 7 to In Progress
  • Assignee set to Sage Weil
  • Priority changed from High to Urgent

mikedawson reports that with higher load this continues to happen even with 'mon compact on trim = true'.

Actions #5

Updated by Sage Weil almost 11 years ago

  • Status changed from In Progress to Need More Info

please reproduce with wip-mon-trace:

- restart ceph-mon with --mon-compact-on-start
- stop ceph-mon
- verify it is initially smallish
- make copy of the store
- start ceph-mon while generating a trace (use option --mon-debug-dump-transactions)
- wait for it to grow significantly
- tar up original store + trace

so we can reproduce failed compaction locally

Actions #6

Updated by Sage Weil almost 11 years ago

  • Status changed from Need More Info to In Progress
Actions #7

Updated by Sage Weil almost 11 years ago

  • Status changed from In Progress to Need More Info
Actions #8

Updated by Sage Weil almost 11 years ago

  • Priority changed from Urgent to High
Actions #9

Updated by Sylvain Munaut almost 11 years ago

I'd like to report that I also have this happenning ...

All the mons have grown to > 9G this night (from 200 M usually) and this took the cluster down. Hopefully there was still a few hundred megs of space on the drive and restarting the mons made the store go back to ~ 200M (at least for 2 out of 3 mon, the third is just doing nothing anymore AFAICT) but this is was really close to taking the whole prod cluster down.

I'm using 0.61.2 and I don't have any "mon compact on trim" option set so it's "true" by default I think.

Actions #10

Updated by Nigel Williams almost 11 years ago

I'm on 0.61.2 as well and I can report that I struck this issue too. tnt suggested the MON restart and that fixed it.

07:39 < nigwil> root@ceph0:~# du -sh /var/lib/ceph/mon
07:39 < nigwil> 4.7G /var/lib/ceph/mon

after restart:

root@ceph0:/var/log/ceph# du -sh /var/lib/ceph/mon
62M /var/lib/ceph/mon
root@ceph0:/var/log/ceph# ceph health detail
HEALTH_OK
root@ceph0:/var/log/ceph# ceph -v
ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60)

Actions #11

Updated by Joao Eduardo Luis almost 11 years ago

On IRC, Florian Wiessner have been mentioning some strange behaviors that may be related to leveldb growth/compaction and are worth investigating:

  • While compating, monitor elections tend to be triggered.
  • His store appears to "grow like hell" when the monitors are not on a dedicated FS. Using SSDs make the problem go away altogether.
<mrjack> yeah
<mrjack> i have a theory
<mrjack> i only see that on our own ceph cluster
<mrjack> not on the two other ceph clusters i manage
<mrjack> and there i think the problem is that the mon-directory is on a fs that has other IO going on...
<mrjack> so i get elections frequently when the mon cannot write quick enough..
<mrjack> i have it on ext3 currently.. i can see kjournald taking 99%IO with iotop...
<mrjack> i'll replace these disks with ssds in the near future... 
<mrjack> on the other installation for my customer, we use ssds for mon and journal... there the journal [sic, store.db] is not getting bigger >60MB
<mrjack> i see from that customers cluster... monitor election epochs...
<mrjack> before the ssd upgrade (three days ago) there have been about 32k elections
<mrjack> now since we decided to move mon to ssd.. there has not been a single election since then
<mrjack> ah ok there have been elections this night
<mrjack> 4 elections
<mrjack> when the cluster was heavy loaded doing backup cronjobs
<mrjack> but only 4 since ssd upgrade on 2013-05-14
<mrjack> before it was like election time every 5 minutes
Actions #12

Updated by Mike Dawson almost 11 years ago

When this state occurs, leveldb compacts on trim as expected, but the store either too large or growing fast enough that leveldb compaction is insufficient. As Florian has observed, I imagine the situation is worse for monitors backed by storage without the capability of high IOPS or throughput, but SSDs are certainly NOT curative. I've seen this issue many times with SSDs that can do ~40K IOPS and ~250MB/s.

After your leveldb stores have grown past the point of no return, the only curative step seems to get the monitors to compact on startup before any writes triggered by the OSDs occur. For me, this means stopping all MON and OSD processes. Then start MONs, and wait for them to compact under a few hundred MB, then start OSDs.

If store.db grows over some unknown size where leveldb compaction cannot keep up, it will grow unbounded until I intervene and shut down my cluster completely. In this state, my store.db grows at a pretty consistent ~1.5GB/hour.

Actions #13

Updated by Sage Weil almost 11 years ago

  • Priority changed from High to Urgent

If anybody who sees this can generate a leveldb trace file (procedure is described above) i think that will help.

The most promising theory right now in my mind is that this is actually load-related. Having the trace will let us confirm or disprove that, and even if it is load related, hopefully we'll be able to reproduce that by hammering the disk in parallel and replaying the trace.

Thanks!

Actions #14

Updated by Mike Dawson almost 11 years ago

Sage,

This weekend my monitor leveldb store.db's started growing again. Although I don't know when it started for sure, I noticed it after I lost an OSD to a drive failure. When I responded, the cluster was re-balancing under heavy load. Monitor store.db were ~12GB each, iirc. I let the OSDs rebalance and get back to HEALTH_OK before starting the transaction dump collection process. By that time, store.db was up to ~17GB iirc. I let it run for ~6 hours before collecting results.

On cephdrop, you'll see a mikedawson-tdump dir with the tdump file, store.db before, and store.db after. The tdump was over 12GB and the store.db grew ~5GB during that time.

Afterwards, I was able to compact the stores on startup by stopping all OSDs and MONs, then starting the MONs with "mon compact on start = true". Had to restart the mons two or three times to get the three stores below 400MB each.

Actions #15

Updated by Sage Weil almost 11 years ago

mike: it looks like your store is growing not because of leveldb but because paxos isn't trimming. can you generate a mon log, debug mon = 20, debug paxos = 20, debug ms = 1 ?

Actions #16

Updated by Sylvain Munaut almost 11 years ago

@Sage Weil: Looks like you're right.

My mon started again growing this morning (pretty much at 6am on the spot, the time of the log rotate ... I wonder if it's just a coincidence ...)

I collected the log (attached) and you can see the first_commit is stuck to 28232852 (and so there is nearly 12k commits being stored).

Actions #17

Updated by Sylvain Munaut almost 11 years ago

nm the comment about log rotate, I got confused between UTC and local time ...

Actions #18

Updated by Sylvain Munaut almost 11 years ago

Just a quick update for those following the bug and not on IRC:

joao found out that when there is an election while a trim is pending, it will erroneously just suppress all pending proposals without finishing them and so the 'going_to_trim' flag is stuck to true and it's never going to submit another trim again.

Actions #19

Updated by Sage Weil almost 11 years ago

  • Status changed from Need More Info to 7

wip-4895-cuttlefish has a backport of the proposed fix. anyone experiencing growth, please test!

Actions #20

Updated by Sylvain Munaut almost 11 years ago

I've been testing this for the last 5 days and I haven't seen any uncontrolled/fast growth of the mon store like I used to see. ( for reference, the longer I've ever been on 0.61 without seeing this before this patch was ~ 48 hours ).

Actions #21

Updated by Sage Weil almost 11 years ago

  • Status changed from 7 to Resolved

awesome. Sylvain, can you try setting 'mon compact on trim = false' and seeing if it continues to not grow? the orignial problem is hopefully the paxos trim issue and not an error in leveldb's compaction logic...

Actions #22

Updated by Sylvain Munaut almost 11 years ago

I just disabled compact-on-trim, and it doesn't look good :( It grew about 1GB in 2 hours.
On the plus side, there is much less IO though.

Actions #23

Updated by Sylvain Munaut almost 11 years ago

See https://code.google.com/p/leveldb/issues/detail?id=158 and the discussion https://groups.google.com/forum/#!msg/leveldb/yL6h1mAOc20/vLU64RylIdMJ

Basically deleted entries that are in "unused" zones of the keyspace will never get compacted.

Actions

Also available in: Atom PDF