Project

General

Profile

Bug #19446

rgw: heavy memory leak when multisite sync fail on 10.2.6

Added by Haroboro Ha 9 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
Start date:
04/02/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
jewel, kraken
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
jewel
Needs Doc:
No

Description

heavry memory leak was found when the site sync failed in an active-active multisite clusters (2 sites). In about 2 hours, it will consume about 60GB memory.

when the following log occured every time, the radosgw RSS increased again and again.

2017-04-02 22:56:10.621369 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 22:56:14.772360 7fa580fe1700 0 ERROR: failure in sync, backing out (sync_status=-5)
2017-04-02 22:56:14.782136 7fa580fe1700 0 ERROR: failed to log sync failure in error repo: retcode=0
2017-04-02 22:56:14.782207 7fa580fe1700 0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
2017-04-02 22:57:53.071968 7fa580fe1700 0 ERROR: failed to read remote data log info: ret=-5
2017-04-02 22:57:53.165198 7fa580fe1700 0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
2017-04-02 22:58:44.120891 7fa580fe1700 0 ERROR: failed to read remote data log info: ret=-5
2017-04-02 22:58:44.124339 7fa580fe1700 0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
2017-04-02 22:59:25.327494 7fa580fe1700 0 ERROR: failed to read remote data log info: ret=-5
2017-04-02 22:59:25.406766 7fa580fe1700 0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
2017-04-02 23:00:35.433212 7fa580fe1700 0 ERROR: failed to fetch remote data log info: ret=-5
2017-04-02 23:00:35.442030 7fa580fe1700 0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
2017-04-02 23:00:55.583137 7fa580fe1700 0 ERROR: failed to read remote data log info: ret=-5
2017-04-02 23:00:55.587294 7fa580fe1700 0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
2017-04-02 23:02:12.050433 7fa58bff7700 0 store->fetch_remote_obj() returned r=-2
2017-04-02 23:02:12.050548 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 23:02:12.052318 7fa584fe9700 0 store->fetch_remote_obj() returned r=-2
2017-04-02 23:02:12.052447 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 23:02:12.054184 7fa5977fe700 0 store->fetch_remote_obj() returned r=-2
2017-04-02 23:02:12.054292 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 23:02:12.056197 7fa58cff9700 0 store->fetch_remote_obj() returned r=-2
2017-04-02 23:02:12.056307 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 23:02:12.064099 7fa5897f2700 0 store->fetch_remote_obj() returned r=-2
2017-04-02 23:02:12.064239 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 23:02:13.126223 7fa580fe1700 0 ERROR: failure in sync, backing out (sync_status=-2)
2017-04-02 23:02:13.233178 7fa580fe1700 0 WARNING: skipping data log entry for missing bucket aspen:1d0e03f4-f7fc-4ee6-a956-b66483526e3d.4741.4
2017-04-02 23:02:27.769720 7fa5887f0700 0 store->fetch_remote_obj() returned r=-2
2017-04-02 23:02:27.772560 7fa58ffff700 0 store->fetch_remote_obj() returned r=-2
2017-04-02 23:02:27.788009 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 23:02:27.788023 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 23:02:27.963422 7fa580fe1700 0 ERROR: failure in sync, backing out (sync_status=-2)
2017-04-02 23:02:28.093612 7fa580fe1700 0 WARNING: skipping data log entry for missing bucket aspen:1d0e03f4-f7fc-4ee6-a956-b66483526e3d.4741.4

rgw-error-warn.log View - ERROR|WARN in ceph-client.rgw.ceph-1.log (842 KB) Gabriel Wicke, 04/15/2017 05:41 PM

debug_rgw_deletes.log.xz - radosgw log in swift-bench delete phase (39.3 KB) Gabriel Wicke, 04/17/2017 09:12 PM

debug_rgw_post_bucket_drop.log.xz - radosgw log after deletion of previous bucket, with swift-bench pushing new puts for new bucket (88.8 KB) Gabriel Wicke, 04/17/2017 09:13 PM

debug_rgw_puts.log.xz - radosgw log in swift-bench put phase (55.6 KB) Gabriel Wicke, 04/17/2017 09:13 PM

debug_20_rgw_deletes.log.xz - 10 second snapshot of debug 20 log during deletions (333 KB) Gabriel Wicke, 04/17/2017 09:50 PM

debug_20_rgw_post_bucket_delete_puts.log.xz - 10 second snapshot of debug 20 log during puts, after previous bucket was deleted (475 KB) Gabriel Wicke, 04/17/2017 09:50 PM

massif-3.txt View (461 KB) Gabriel Wicke, 04/20/2017 03:07 AM


Related issues

Related to rgw - Bug #19861: multisite: memory leak on failed lease in RGWDataSyncShardCR Resolved 05/04/2017

History

#1 Updated by Haroboro Ha 9 months ago

runtime environment: ubuntu 14.04
radosgw version: 10.2.6-1trusty

#2 Updated by Nathan Cutler 9 months ago

  • Project changed from Ceph to rgw
  • Category deleted (radosgw)
  • Target version deleted (v10.2.7)
  • Affected Versions v10.2.6 added
  • Release jewel added

#3 Updated by Casey Bodley 8 months ago

  • Assignee set to Casey Bodley

#4 Updated by Gabriel Wicke 8 months ago

I am seeing a leak with active-active multi site operation on 10.2.7 as well, and some of the log entries match the ones reported by Haroboro:

2017-04-14 23:52:59.644911 7f3603fd6700 0 ERROR: a sync operation returned error
2017-04-14 23:52:59.644937 7f3603fd6700 0 ERROR: a sync operation returned error
2017-04-14 23:52:59.644948 7f3603fd6700 0 ERROR: a sync operation returned error
2017-04-14 23:52:59.644960 7f3603fd6700 0 ERROR: failure in sync, backing out (sync_status=-2)
2017-04-14 23:52:59.651205 7f3603fd6700 0 WARNING: skipping data log entry for missing bucket 72f6b176c6ff44e99eb93ee11d2d5bf6_6:6924a779-a3f3-4e8c-a5e3-666cd5f0d97c.34099.12:36

During heavy load, the "ERROR: a sync operation returned error" entries are very frequent, at a rate of a few dozen per second. This correlates with fairly rapid memory growth of the radosgw process.

In contrast with Haroboro's report, I have no errors related to RGWBackoffControlCR.

#5 Updated by Gabriel Wicke 8 months ago

Here are the last 10k ERROR|WARN lines of the rgw log on one of the nodes.

#6 Updated by Gabriel Wicke 8 months ago

To rule out libcurl issues I upgraded to 7.52.1-4 from Debian sid. This did not resolve the issue.

The vast majority of the heap growth seemed to happen while processing replicated operations for a bucket that was just deleted by swift-bench. This would correlate with the "WARNING: skipping data log entry for missing bucket" and "ERROR: failure in sync, backing out (sync_status=-2)" log entries,

#7 Updated by Gabriel Wicke 8 months ago

Some more detail on the test:
swift-bench.conf:
[bench]
auth = http://ceph2-1/auth/v1.0
user = benchmark:swift
key = XXXXXX
auth_version = 1.0

benchmark loop:
while true; do swift-bench -c 32 -C1 -l 1000 -s 200000 -n 10000 -g 100 ./swift-bench.conf; done

Benchmark is run against radosgw instance on secondary (non-master) zone. Logs are from radosgw instance on master zone. Swapping the roles did not seem to make a difference. Both master and slave radosgw instances are growing in RSS beyond 1G.

#8 Updated by Gabriel Wicke 8 months ago

It looks like the dynamic log level change didn't work the previous time, so here are actual level 20 log snippets.

#9 Updated by Gabriel Wicke 8 months ago

I did a bit more testing with valgrind. While memcheck didn't yield anything consistent, a massif run over ~30 minutes showed this allocation chain using over 40% of memory:

->40.68% (40,051,440B) 0x51E7C89: RGWDataSyncShardCR::incremental_sync() (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x51EA08B: RGWDataSyncShardCR::operate() (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x50F89FC: RGWCoroutinesStack::operate(RGWCoroutinesEnv*) (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x50FBE75: RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&) (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x50FDDCC: RGWCoroutinesManager::run(RGWCoroutine*) (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x51C754F: RGWRemoteDataLog::run_sync(int, rgw_data_sync_status&) (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x529FD47: RGWDataSyncProcessorThread::process() (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x523C0F4: RGWRadosThread::Worker::entry() (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0xF1A7492: start_thread (pthread_create.c:333) | ->40.68% (40,051,440B) 0xFF3F93D: clone (clone.S:97)

#10 Updated by Gabriel Wicke 8 months ago

Same with hopefully better formatting:

->40.68% (40,051,440B) 0x51E7C89: RGWDataSyncShardCR::incremental_sync() (in /usr/lib/librgw.so.2.0.0)
| ->40.68% (40,051,440B) 0x51EA08B: RGWDataSyncShardCR::operate() (in /usr/lib/librgw.so.2.0.0)
|   ->40.68% (40,051,440B) 0x50F89FC: RGWCoroutinesStack::operate(RGWCoroutinesEnv*) (in /usr/lib/librgw.so.2.0.0)
|     ->40.68% (40,051,440B) 0x50FBE75: RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&) (in /usr/lib/librgw.so.2.0.0)
|       ->40.68% (40,051,440B) 0x50FDDCC: RGWCoroutinesManager::run(RGWCoroutine*) (in /usr/lib/librgw.so.2.0.0)
|         ->40.68% (40,051,440B) 0x51C754F: RGWRemoteDataLog::run_sync(int, rgw_data_sync_status&) (in /usr/lib/librgw.so.2.0.0)
|           ->40.68% (40,051,440B) 0x529FD47: RGWDataSyncProcessorThread::process() (in /usr/lib/librgw.so.2.0.0)
|             ->40.68% (40,051,440B) 0x523C0F4: RGWRadosThread::Worker::entry() (in /usr/lib/librgw.so.2.0.0)
|               ->40.68% (40,051,440B) 0xF1A7492: start_thread (pthread_create.c:333)
|                 ->40.68% (40,051,440B) 0xFF3F93D: clone (clone.S:97)

#11 Updated by Gabriel Wicke 8 months ago

Complete output from a ~2 hours massif run attached.

The top allocation groups are all the same, with similar ratios between them. Overall it looks like sync coroutines or threads are not collected / freed in a timely manner during continuous operation. They are freed on shutdown, which is why memcheck results were inconclusive.

#12 Updated by Casey Bodley 8 months ago

testing https://github.com/ceph/ceph/pull/14714 as a potential fix

#13 Updated by Casey Bodley 7 months ago

  • Related to Bug #19861: multisite: memory leak on failed lease in RGWDataSyncShardCR added

#14 Updated by Nathan Cutler 7 months ago

  • Backport set to jewel, kraken

#15 Updated by Nathan Cutler 7 months ago

If this ends up being fixed by https://github.com/ceph/ceph/pull/14714 that is the same fix as for #19861 - in that case only one of the two tracker issues needs to have the backport field populated.

#16 Updated by Casey Bodley 7 months ago

Nathan Cutler wrote:

If this ends up being fixed by https://github.com/ceph/ceph/pull/14714 that is the same fix as for #19861 - in that case only one of the two tracker issues needs to have the backport field populated.

right, thanks. we haven't confirmed that https://github.com/ceph/ceph/pull/14714 addresses all leaks, so i'm not comfortable closing this one - that's why i added the separate backport ticket

#17 Updated by Yehuda Sadeh 6 months ago

  • Subject changed from heavy memory leak when multisite sync fail on 10.2.6 to rgw: heavy memory leak when multisite sync fail on 10.2.6

#18 Updated by Orit Wasserman 6 months ago

  • Status changed from New to Resolved

Also available in: Atom PDF