https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2017-04-02T15:25:27ZCeph rgw - Bug #19446: rgw: heavy memory leak when multisite sync fail on 10.2.6https://tracker.ceph.com/issues/19446?journal_id=883212017-04-02T15:25:27ZHaroboro Ha
<ul></ul><p>runtime environment: ubuntu 14.04<br />radosgw version: 10.2.6-1trusty</p> rgw - Bug #19446: rgw: heavy memory leak when multisite sync fail on 10.2.6https://tracker.ceph.com/issues/19446?journal_id=883312017-04-03T09:40:40ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Project</strong> changed from <i>Ceph</i> to <i>rgw</i></li><li><strong>Category</strong> deleted (<del><i>22</i></del>)</li><li><strong>Target version</strong> deleted (<del><i>v10.2.7</i></del>)</li><li><strong>Release</strong> set to <i>jewel</i></li><li><strong>Affected Versions</strong> <i>v10.2.6</i> added</li></ul> rgw - Bug #19446: rgw: heavy memory leak when multisite sync fail on 10.2.6https://tracker.ceph.com/issues/19446?journal_id=885102017-04-05T15:29:58ZCasey Bodleycbodley@redhat.com
<ul><li><strong>Assignee</strong> set to <i>Casey Bodley</i></li></ul> rgw - Bug #19446: rgw: heavy memory leak when multisite sync fail on 10.2.6https://tracker.ceph.com/issues/19446?journal_id=894712017-04-15T17:35:32ZGabriel Wicke
<ul></ul><p>I am seeing a leak with active-active multi site operation on 10.2.7 as well, and some of the log entries match the ones reported by Haroboro:</p>
<p>2017-04-14 23:52:59.644911 7f3603fd6700 0 ERROR: a sync operation returned error<br />2017-04-14 23:52:59.644937 7f3603fd6700 0 ERROR: a sync operation returned error<br />2017-04-14 23:52:59.644948 7f3603fd6700 0 ERROR: a sync operation returned error<br />2017-04-14 23:52:59.644960 7f3603fd6700 0 ERROR: failure in sync, backing out (sync_status=-2)<br />2017-04-14 23:52:59.651205 7f3603fd6700 0 WARNING: skipping data log entry for missing bucket 72f6b176c6ff44e99eb93ee11d2d5bf6_6:6924a779-a3f3-4e8c-a5e3-666cd5f0d97c.34099.12:36</p>
<p>During heavy load, the "ERROR: a sync operation returned error" entries are very frequent, at a rate of a few dozen per second. This correlates with fairly rapid memory growth of the radosgw process.</p>
<p>In contrast with Haroboro's report, I have no errors related to RGWBackoffControlCR.</p> rgw - Bug #19446: rgw: heavy memory leak when multisite sync fail on 10.2.6https://tracker.ceph.com/issues/19446?journal_id=894722017-04-15T17:42:58ZGabriel Wicke
<ul><li><strong>File</strong> <a href="/attachments/download/2789/rgw-error-warn.log">rgw-error-warn.log</a> <a class="icon-only icon-magnifier" title="View" href="/attachments/2789/rgw-error-warn.log">View</a> added</li></ul><p>Here are the last 10k ERROR|WARN lines of the rgw log on one of the nodes.</p> rgw - Bug #19446: rgw: heavy memory leak when multisite sync fail on 10.2.6https://tracker.ceph.com/issues/19446?journal_id=895112017-04-17T19:36:42ZGabriel Wicke
<ul></ul><p>To rule out libcurl issues I upgraded to 7.52.1-4 from Debian sid. This did not resolve the issue.</p>
<p>The vast majority of the heap growth seemed to happen while processing replicated operations for a bucket that was just deleted by swift-bench. This would correlate with the "WARNING: skipping data log entry for missing bucket" and "ERROR: failure in sync, backing out (sync_status=-2)" log entries,</p> rgw - Bug #19446: rgw: heavy memory leak when multisite sync fail on 10.2.6https://tracker.ceph.com/issues/19446?journal_id=895152017-04-17T21:17:25ZGabriel Wicke
<ul><li><strong>File</strong> <a href="/attachments/download/2790/debug_rgw_deletes.log.xz">debug_rgw_deletes.log.xz</a> added</li><li><strong>File</strong> <a href="/attachments/download/2791/debug_rgw_post_bucket_drop.log.xz">debug_rgw_post_bucket_drop.log.xz</a> added</li><li><strong>File</strong> <a href="/attachments/download/2792/debug_rgw_puts.log.xz">debug_rgw_puts.log.xz</a> added</li></ul><p>Some more detail on the test:<br />swift-bench.conf:<br />[bench]<br />auth = <a class="external" href="http://ceph2-1/auth/v1.0">http://ceph2-1/auth/v1.0</a><br />user = benchmark:swift<br />key = XXXXXX<br />auth_version = 1.0</p>
<p>benchmark loop:<br />while true; do swift-bench -c 32 -C1 -l 1000 -s 200000 -n 10000 -g 100 ./swift-bench.conf; done</p>
<p>Benchmark is run against radosgw instance on secondary (non-master) zone. Logs are from radosgw instance on master zone. Swapping the roles did not seem to make a difference. Both master and slave radosgw instances are growing in RSS beyond 1G.</p> rgw - Bug #19446: rgw: heavy memory leak when multisite sync fail on 10.2.6https://tracker.ceph.com/issues/19446?journal_id=895172017-04-17T21:52:12ZGabriel Wicke
<ul><li><strong>File</strong> <a href="/attachments/download/2793/debug_20_rgw_deletes.log.xz">debug_20_rgw_deletes.log.xz</a> added</li><li><strong>File</strong> <a href="/attachments/download/2794/debug_20_rgw_post_bucket_delete_puts.log.xz">debug_20_rgw_post_bucket_delete_puts.log.xz</a> added</li></ul><p>It looks like the dynamic log level change didn't work the previous time, so here are actual level 20 log snippets.</p> rgw - Bug #19446: rgw: heavy memory leak when multisite sync fail on 10.2.6https://tracker.ceph.com/issues/19446?journal_id=898892017-04-20T00:50:35ZGabriel Wicke
<ul></ul><p>I did a bit more testing with valgrind. While memcheck didn't yield anything consistent, a massif run over ~30 minutes showed this allocation chain using over 40% of memory:</p>
<p>->40.68% (40,051,440B) 0x51E7C89: RGWDataSyncShardCR::incremental_sync() (in /usr/lib/librgw.so.2.0.0)
| ->40.68% (40,051,440B) 0x51EA08B: RGWDataSyncShardCR::operate() (in /usr/lib/librgw.so.2.0.0)
| ->40.68% (40,051,440B) 0x50F89FC: RGWCoroutinesStack::operate(RGWCoroutinesEnv*) (in /usr/lib/librgw.so.2.0.0)
| ->40.68% (40,051,440B) 0x50FBE75: RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&) (in /usr/lib/librgw.so.2.0.0)
| ->40.68% (40,051,440B) 0x50FDDCC: RGWCoroutinesManager::run(RGWCoroutine*) (in /usr/lib/librgw.so.2.0.0)
| ->40.68% (40,051,440B) 0x51C754F: RGWRemoteDataLog::run_sync(int, rgw_data_sync_status&) (in /usr/lib/librgw.so.2.0.0)
| ->40.68% (40,051,440B) 0x529FD47: RGWDataSyncProcessorThread::process() (in /usr/lib/librgw.so.2.0.0)
| ->40.68% (40,051,440B) 0x523C0F4: RGWRadosThread::Worker::entry() (in /usr/lib/librgw.so.2.0.0)
| ->40.68% (40,051,440B) 0xF1A7492: start_thread (pthread_create.c:333)
| ->40.68% (40,051,440B) 0xFF3F93D: clone (clone.S:97)</p> rgw - Bug #19446: rgw: heavy memory leak when multisite sync fail on 10.2.6https://tracker.ceph.com/issues/19446?journal_id=898902017-04-20T00:51:23ZGabriel Wicke
<ul></ul><p>Same with hopefully better formatting:<br /><pre>
->40.68% (40,051,440B) 0x51E7C89: RGWDataSyncShardCR::incremental_sync() (in /usr/lib/librgw.so.2.0.0)
| ->40.68% (40,051,440B) 0x51EA08B: RGWDataSyncShardCR::operate() (in /usr/lib/librgw.so.2.0.0)
| ->40.68% (40,051,440B) 0x50F89FC: RGWCoroutinesStack::operate(RGWCoroutinesEnv*) (in /usr/lib/librgw.so.2.0.0)
| ->40.68% (40,051,440B) 0x50FBE75: RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&) (in /usr/lib/librgw.so.2.0.0)
| ->40.68% (40,051,440B) 0x50FDDCC: RGWCoroutinesManager::run(RGWCoroutine*) (in /usr/lib/librgw.so.2.0.0)
| ->40.68% (40,051,440B) 0x51C754F: RGWRemoteDataLog::run_sync(int, rgw_data_sync_status&) (in /usr/lib/librgw.so.2.0.0)
| ->40.68% (40,051,440B) 0x529FD47: RGWDataSyncProcessorThread::process() (in /usr/lib/librgw.so.2.0.0)
| ->40.68% (40,051,440B) 0x523C0F4: RGWRadosThread::Worker::entry() (in /usr/lib/librgw.so.2.0.0)
| ->40.68% (40,051,440B) 0xF1A7492: start_thread (pthread_create.c:333)
| ->40.68% (40,051,440B) 0xFF3F93D: clone (clone.S:97)
</pre></p> rgw - Bug #19446: rgw: heavy memory leak when multisite sync fail on 10.2.6https://tracker.ceph.com/issues/19446?journal_id=898952017-04-20T03:13:21ZGabriel Wicke
<ul><li><strong>File</strong> <a href="/attachments/download/2796/massif-3.txt">massif-3.txt</a> <a class="icon-only icon-magnifier" title="View" href="/attachments/2796/massif-3.txt">View</a> added</li></ul><p>Complete output from a ~2 hours massif run attached.</p>
<p>The top allocation groups are all the same, with similar ratios between them. Overall it looks like sync coroutines or threads are not collected / freed in a timely manner during continuous operation. They are freed on shutdown, which is why memcheck results were inconclusive.</p> rgw - Bug #19446: rgw: heavy memory leak when multisite sync fail on 10.2.6https://tracker.ceph.com/issues/19446?journal_id=901682017-04-21T18:54:15ZCasey Bodleycbodley@redhat.com
<ul></ul><p>testing <a class="external" href="https://github.com/ceph/ceph/pull/14714">https://github.com/ceph/ceph/pull/14714</a> as a potential fix</p> rgw - Bug #19446: rgw: heavy memory leak when multisite sync fail on 10.2.6https://tracker.ceph.com/issues/19446?journal_id=907612017-05-04T18:18:06ZCasey Bodleycbodley@redhat.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-3 priority-4 priority-default closed" href="/issues/19861">Bug #19861</a>: multisite: memory leak on failed lease in RGWDataSyncShardCR</i> added</li></ul> rgw - Bug #19446: rgw: heavy memory leak when multisite sync fail on 10.2.6https://tracker.ceph.com/issues/19446?journal_id=907772017-05-04T20:20:13ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Backport</strong> set to <i>jewel, kraken</i></li></ul> rgw - Bug #19446: rgw: heavy memory leak when multisite sync fail on 10.2.6https://tracker.ceph.com/issues/19446?journal_id=907782017-05-04T20:34:23ZNathan Cutlerncutler@suse.cz
<ul></ul><p>If this ends up being fixed by <a class="external" href="https://github.com/ceph/ceph/pull/14714">https://github.com/ceph/ceph/pull/14714</a> that is the same fix as for <a class="issue tracker-1 status-3 priority-4 priority-default closed" title="Bug: multisite: memory leak on failed lease in RGWDataSyncShardCR (Resolved)" href="https://tracker.ceph.com/issues/19861">#19861</a> - in that case only one of the two tracker issues needs to have the backport field populated.</p> rgw - Bug #19446: rgw: heavy memory leak when multisite sync fail on 10.2.6https://tracker.ceph.com/issues/19446?journal_id=908252017-05-05T13:38:22ZCasey Bodleycbodley@redhat.com
<ul></ul><p>Nathan Cutler wrote:</p>
<blockquote>
<p>If this ends up being fixed by <a class="external" href="https://github.com/ceph/ceph/pull/14714">https://github.com/ceph/ceph/pull/14714</a> that is the same fix as for <a class="issue tracker-1 status-3 priority-4 priority-default closed" title="Bug: multisite: memory leak on failed lease in RGWDataSyncShardCR (Resolved)" href="https://tracker.ceph.com/issues/19861">#19861</a> - in that case only one of the two tracker issues needs to have the backport field populated.</p>
</blockquote>
<p>right, thanks. we haven't confirmed that <a class="external" href="https://github.com/ceph/ceph/pull/14714">https://github.com/ceph/ceph/pull/14714</a> addresses all leaks, so i'm not comfortable closing this one - that's why i added the separate backport ticket</p> rgw - Bug #19446: rgw: heavy memory leak when multisite sync fail on 10.2.6https://tracker.ceph.com/issues/19446?journal_id=922622017-06-06T18:12:08ZYehuda Sadehyehuda@redhat.com
<ul><li><strong>Subject</strong> changed from <i>heavy memory leak when multisite sync fail on 10.2.6</i> to <i>rgw: heavy memory leak when multisite sync fail on 10.2.6</i></li></ul> rgw - Bug #19446: rgw: heavy memory leak when multisite sync fail on 10.2.6https://tracker.ceph.com/issues/19446?journal_id=936202017-06-22T17:57:23ZOrit Wassermanowasserm@redhat.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Resolved</i></li></ul>