Bug #19446
closedrgw: heavy memory leak when multisite sync fail on 10.2.6
0%
Description
heavry memory leak was found when the site sync failed in an active-active multisite clusters (2 sites). In about 2 hours, it will consume about 60GB memory.
when the following log occured every time, the radosgw RSS increased again and again.
2017-04-02 22:56:10.621369 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 22:56:14.772360 7fa580fe1700 0 ERROR: failure in sync, backing out (sync_status=-5)
2017-04-02 22:56:14.782136 7fa580fe1700 0 ERROR: failed to log sync failure in error repo: retcode=0
2017-04-02 22:56:14.782207 7fa580fe1700 0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned 5>fetch_remote_obj() returned r=-2
2017-04-02 22:57:53.071968 7fa580fe1700 0 ERROR: failed to read remote data log info: ret=-5
2017-04-02 22:57:53.165198 7fa580fe1700 0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
2017-04-02 22:58:44.120891 7fa580fe1700 0 ERROR: failed to read remote data log info: ret=-5
2017-04-02 22:58:44.124339 7fa580fe1700 0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
2017-04-02 22:59:25.327494 7fa580fe1700 0 ERROR: failed to read remote data log info: ret=-5
2017-04-02 22:59:25.406766 7fa580fe1700 0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
2017-04-02 23:00:35.433212 7fa580fe1700 0 ERROR: failed to fetch remote data log info: ret=-5
2017-04-02 23:00:35.442030 7fa580fe1700 0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
2017-04-02 23:00:55.583137 7fa580fe1700 0 ERROR: failed to read remote data log info: ret=-5
2017-04-02 23:00:55.587294 7fa580fe1700 0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
2017-04-02 23:02:12.050433 7fa58bff7700 0 store
2017-04-02 23:02:12.050548 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 23:02:12.052318 7fa584fe9700 0 store->fetch_remote_obj() returned r=-2
2017-04-02 23:02:12.052447 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 23:02:12.054184 7fa5977fe700 0 store->fetch_remote_obj() returned r=-2
2017-04-02 23:02:12.054292 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 23:02:12.056197 7fa58cff9700 0 store->fetch_remote_obj() returned r=-2
2017-04-02 23:02:12.056307 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 23:02:12.064099 7fa5897f2700 0 store->fetch_remote_obj() returned r=-2
2017-04-02 23:02:12.064239 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 23:02:13.126223 7fa580fe1700 0 ERROR: failure in sync, backing out (sync_status=-2)
2017-04-02 23:02:13.233178 7fa580fe1700 0 WARNING: skipping data log entry for missing bucket aspen:1d0e03f4-f7fc-4ee6-a956-b66483526e3d.4741.4
2017-04-02 23:02:27.769720 7fa5887f0700 0 store->fetch_remote_obj() returned r=-2
2017-04-02 23:02:27.772560 7fa58ffff700 0 store->fetch_remote_obj() returned r=-2
2017-04-02 23:02:27.788009 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 23:02:27.788023 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 23:02:27.963422 7fa580fe1700 0 ERROR: failure in sync, backing out (sync_status=-2)
2017-04-02 23:02:28.093612 7fa580fe1700 0 WARNING: skipping data log entry for missing bucket aspen:1d0e03f4-f7fc-4ee6-a956-b66483526e3d.4741.4
Files
Updated by Haroboro Ha about 7 years ago
runtime environment: ubuntu 14.04
radosgw version: 10.2.6-1trusty
Updated by Nathan Cutler about 7 years ago
- Project changed from Ceph to rgw
- Category deleted (
22) - Target version deleted (
v10.2.7) - Release set to jewel
- Affected Versions v10.2.6 added
Updated by Gabriel Wicke about 7 years ago
I am seeing a leak with active-active multi site operation on 10.2.7 as well, and some of the log entries match the ones reported by Haroboro:
2017-04-14 23:52:59.644911 7f3603fd6700 0 ERROR: a sync operation returned error
2017-04-14 23:52:59.644937 7f3603fd6700 0 ERROR: a sync operation returned error
2017-04-14 23:52:59.644948 7f3603fd6700 0 ERROR: a sync operation returned error
2017-04-14 23:52:59.644960 7f3603fd6700 0 ERROR: failure in sync, backing out (sync_status=-2)
2017-04-14 23:52:59.651205 7f3603fd6700 0 WARNING: skipping data log entry for missing bucket 72f6b176c6ff44e99eb93ee11d2d5bf6_6:6924a779-a3f3-4e8c-a5e3-666cd5f0d97c.34099.12:36
During heavy load, the "ERROR: a sync operation returned error" entries are very frequent, at a rate of a few dozen per second. This correlates with fairly rapid memory growth of the radosgw process.
In contrast with Haroboro's report, I have no errors related to RGWBackoffControlCR.
Updated by Gabriel Wicke about 7 years ago
- File rgw-error-warn.log rgw-error-warn.log added
Here are the last 10k ERROR|WARN lines of the rgw log on one of the nodes.
Updated by Gabriel Wicke about 7 years ago
To rule out libcurl issues I upgraded to 7.52.1-4 from Debian sid. This did not resolve the issue.
The vast majority of the heap growth seemed to happen while processing replicated operations for a bucket that was just deleted by swift-bench. This would correlate with the "WARNING: skipping data log entry for missing bucket" and "ERROR: failure in sync, backing out (sync_status=-2)" log entries,
Updated by Gabriel Wicke about 7 years ago
- File debug_rgw_deletes.log.xz debug_rgw_deletes.log.xz added
- File debug_rgw_post_bucket_drop.log.xz debug_rgw_post_bucket_drop.log.xz added
- File debug_rgw_puts.log.xz debug_rgw_puts.log.xz added
Some more detail on the test:
swift-bench.conf:
[bench]
auth = http://ceph2-1/auth/v1.0
user = benchmark:swift
key = XXXXXX
auth_version = 1.0
benchmark loop:
while true; do swift-bench -c 32 -C1 -l 1000 -s 200000 -n 10000 -g 100 ./swift-bench.conf; done
Benchmark is run against radosgw instance on secondary (non-master) zone. Logs are from radosgw instance on master zone. Swapping the roles did not seem to make a difference. Both master and slave radosgw instances are growing in RSS beyond 1G.
Updated by Gabriel Wicke about 7 years ago
- File debug_20_rgw_deletes.log.xz debug_20_rgw_deletes.log.xz added
- File debug_20_rgw_post_bucket_delete_puts.log.xz debug_20_rgw_post_bucket_delete_puts.log.xz added
It looks like the dynamic log level change didn't work the previous time, so here are actual level 20 log snippets.
Updated by Gabriel Wicke about 7 years ago
I did a bit more testing with valgrind. While memcheck didn't yield anything consistent, a massif run over ~30 minutes showed this allocation chain using over 40% of memory:
->40.68% (40,051,440B) 0x51E7C89: RGWDataSyncShardCR::incremental_sync() (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x51EA08B: RGWDataSyncShardCR::operate() (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x50F89FC: RGWCoroutinesStack::operate(RGWCoroutinesEnv*) (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x50FBE75: RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&) (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x50FDDCC: RGWCoroutinesManager::run(RGWCoroutine*) (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x51C754F: RGWRemoteDataLog::run_sync(int, rgw_data_sync_status&) (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x529FD47: RGWDataSyncProcessorThread::process() (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x523C0F4: RGWRadosThread::Worker::entry() (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0xF1A7492: start_thread (pthread_create.c:333) | ->40.68% (40,051,440B) 0xFF3F93D: clone (clone.S:97)
Updated by Gabriel Wicke about 7 years ago
Same with hopefully better formatting:
->40.68% (40,051,440B) 0x51E7C89: RGWDataSyncShardCR::incremental_sync() (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x51EA08B: RGWDataSyncShardCR::operate() (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x50F89FC: RGWCoroutinesStack::operate(RGWCoroutinesEnv*) (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x50FBE75: RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&) (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x50FDDCC: RGWCoroutinesManager::run(RGWCoroutine*) (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x51C754F: RGWRemoteDataLog::run_sync(int, rgw_data_sync_status&) (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x529FD47: RGWDataSyncProcessorThread::process() (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x523C0F4: RGWRadosThread::Worker::entry() (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0xF1A7492: start_thread (pthread_create.c:333) | ->40.68% (40,051,440B) 0xFF3F93D: clone (clone.S:97)
Updated by Gabriel Wicke about 7 years ago
- File massif-3.txt massif-3.txt added
Complete output from a ~2 hours massif run attached.
The top allocation groups are all the same, with similar ratios between them. Overall it looks like sync coroutines or threads are not collected / freed in a timely manner during continuous operation. They are freed on shutdown, which is why memcheck results were inconclusive.
Updated by Casey Bodley about 7 years ago
testing https://github.com/ceph/ceph/pull/14714 as a potential fix
Updated by Casey Bodley almost 7 years ago
- Related to Bug #19861: multisite: memory leak on failed lease in RGWDataSyncShardCR added
Updated by Nathan Cutler almost 7 years ago
If this ends up being fixed by https://github.com/ceph/ceph/pull/14714 that is the same fix as for #19861 - in that case only one of the two tracker issues needs to have the backport field populated.
Updated by Casey Bodley almost 7 years ago
Nathan Cutler wrote:
If this ends up being fixed by https://github.com/ceph/ceph/pull/14714 that is the same fix as for #19861 - in that case only one of the two tracker issues needs to have the backport field populated.
right, thanks. we haven't confirmed that https://github.com/ceph/ceph/pull/14714 addresses all leaks, so i'm not comfortable closing this one - that's why i added the separate backport ticket
Updated by Yehuda Sadeh almost 7 years ago
- Subject changed from heavy memory leak when multisite sync fail on 10.2.6 to rgw: heavy memory leak when multisite sync fail on 10.2.6
Updated by Orit Wasserman almost 7 years ago
- Status changed from New to Resolved