Bug #19446: rgw: heavy memory leak when multisite sync fail on 10.2.6 - rgw - Ceph

Actions

Copy link

Bug #19446

closed

rgw: heavy memory leak when multisite sync fail on 10.2.6

Added by Haroboro Ha about 7 years ago. Updated almost 7 years ago.

Status:

Resolved

Priority:

High

Assignee:

Casey Bodley

Target version:

% Done:

Source:

Tags:

Backport:

jewel, kraken

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v10.2.6

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

heavry memory leak was found when the site sync failed in an active-active multisite clusters (2 sites). In about 2 hours, it will consume about 60GB memory.

when the following log occured every time, the radosgw RSS increased again and again.

2017-04-02 22:56:10.621369 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 22:56:14.772360 7fa580fe1700 0 ERROR: failure in sync, backing out (sync_status=-5)
2017-04-02 22:56:14.782136 7fa580fe1700 0 ERROR: failed to log sync failure in error repo: retcode=0
2017-04-02 22:56:14.782207 7fa580fe1700 0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned 5
2017-04-02 22:57:53.071968 7fa580fe1700 0 ERROR: failed to read remote data log info: ret=-5
2017-04-02 22:57:53.165198 7fa580fe1700 0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
2017-04-02 22:58:44.120891 7fa580fe1700 0 ERROR: failed to read remote data log info: ret=-5
2017-04-02 22:58:44.124339 7fa580fe1700 0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
2017-04-02 22:59:25.327494 7fa580fe1700 0 ERROR: failed to read remote data log info: ret=-5
2017-04-02 22:59:25.406766 7fa580fe1700 0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
2017-04-02 23:00:35.433212 7fa580fe1700 0 ERROR: failed to fetch remote data log info: ret=-5
2017-04-02 23:00:35.442030 7fa580fe1700 0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
2017-04-02 23:00:55.583137 7fa580fe1700 0 ERROR: failed to read remote data log info: ret=-5
2017-04-02 23:00:55.587294 7fa580fe1700 0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
2017-04-02 23:02:12.050433 7fa58bff7700 0 store>fetch_remote_obj() returned r=-2
2017-04-02 23:02:12.050548 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 23:02:12.052318 7fa584fe9700 0 store->fetch_remote_obj() returned r=-2
2017-04-02 23:02:12.052447 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 23:02:12.054184 7fa5977fe700 0 store->fetch_remote_obj() returned r=-2
2017-04-02 23:02:12.054292 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 23:02:12.056197 7fa58cff9700 0 store->fetch_remote_obj() returned r=-2
2017-04-02 23:02:12.056307 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 23:02:12.064099 7fa5897f2700 0 store->fetch_remote_obj() returned r=-2
2017-04-02 23:02:12.064239 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 23:02:13.126223 7fa580fe1700 0 ERROR: failure in sync, backing out (sync_status=-2)
2017-04-02 23:02:13.233178 7fa580fe1700 0 WARNING: skipping data log entry for missing bucket aspen:1d0e03f4-f7fc-4ee6-a956-b66483526e3d.4741.4
2017-04-02 23:02:27.769720 7fa5887f0700 0 store->fetch_remote_obj() returned r=-2
2017-04-02 23:02:27.772560 7fa58ffff700 0 store->fetch_remote_obj() returned r=-2
2017-04-02 23:02:27.788009 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 23:02:27.788023 7fa580fe1700 0 ERROR: a sync operation returned error
2017-04-02 23:02:27.963422 7fa580fe1700 0 ERROR: failure in sync, backing out (sync_status=-2)
2017-04-02 23:02:28.093612 7fa580fe1700 0 WARNING: skipping data log entry for missing bucket aspen:1d0e03f4-f7fc-4ee6-a956-b66483526e3d.4741.4

Files

Download all files

rgw-error-warn.log (842 KB) rgw-error-warn.log	ERROR\|WARN in ceph-client.rgw.ceph-1.log	Gabriel Wicke, 04/15/2017 05:41 PM
debug_rgw_deletes.log.xz (39.3 KB) debug_rgw_deletes.log.xz	radosgw log in swift-bench delete phase	Gabriel Wicke, 04/17/2017 09:12 PM
debug_rgw_post_bucket_drop.log.xz (88.8 KB) debug_rgw_post_bucket_drop.log.xz	radosgw log after deletion of previous bucket, with swift-bench pushing new puts for new bucket	Gabriel Wicke, 04/17/2017 09:13 PM
debug_rgw_puts.log.xz (55.6 KB) debug_rgw_puts.log.xz	radosgw log in swift-bench put phase	Gabriel Wicke, 04/17/2017 09:13 PM
debug_20_rgw_deletes.log.xz (333 KB) debug_20_rgw_deletes.log.xz	10 second snapshot of debug 20 log during deletions	Gabriel Wicke, 04/17/2017 09:50 PM
debug_20_rgw_post_bucket_delete_puts.log.xz (475 KB) debug_20_rgw_post_bucket_delete_puts.log.xz	10 second snapshot of debug 20 log during puts, after previous bucket was deleted	Gabriel Wicke, 04/17/2017 09:50 PM
massif-3.txt (461 KB) massif-3.txt		Gabriel Wicke, 04/20/2017 03:07 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Haroboro Ha about 7 years ago

runtime environment: ubuntu 14.04
radosgw version：　10.2.6-1trusty

Actions

Copy link

Updated by Nathan Cutler about 7 years ago

Project changed from Ceph to rgw
Category deleted (22)
Target version deleted (~~v10.2.7~~)
Release set to jewel
Affected Versions v10.2.6 added

Actions

Copy link

Updated by Casey Bodley about 7 years ago

Assignee set to Casey Bodley

Actions

Copy link

Updated by Gabriel Wicke about 7 years ago

I am seeing a leak with active-active multi site operation on 10.2.7 as well, and some of the log entries match the ones reported by Haroboro:

2017-04-14 23:52:59.644911 7f3603fd6700 0 ERROR: a sync operation returned error
2017-04-14 23:52:59.644937 7f3603fd6700 0 ERROR: a sync operation returned error
2017-04-14 23:52:59.644948 7f3603fd6700 0 ERROR: a sync operation returned error
2017-04-14 23:52:59.644960 7f3603fd6700 0 ERROR: failure in sync, backing out (sync_status=-2)
2017-04-14 23:52:59.651205 7f3603fd6700 0 WARNING: skipping data log entry for missing bucket 72f6b176c6ff44e99eb93ee11d2d5bf6_6:6924a779-a3f3-4e8c-a5e3-666cd5f0d97c.34099.12:36

During heavy load, the "ERROR: a sync operation returned error" entries are very frequent, at a rate of a few dozen per second. This correlates with fairly rapid memory growth of the radosgw process.

In contrast with Haroboro's report, I have no errors related to RGWBackoffControlCR.

Actions

Copy link

Updated by Gabriel Wicke about 7 years ago

File rgw-error-warn.log rgw-error-warn.log added

Here are the last 10k ERROR|WARN lines of the rgw log on one of the nodes.

Actions

Copy link

Updated by Gabriel Wicke about 7 years ago

To rule out libcurl issues I upgraded to 7.52.1-4 from Debian sid. This did not resolve the issue.

The vast majority of the heap growth seemed to happen while processing replicated operations for a bucket that was just deleted by swift-bench. This would correlate with the "WARNING: skipping data log entry for missing bucket" and "ERROR: failure in sync, backing out (sync_status=-2)" log entries,

Actions

Copy link Download all files

Updated by Gabriel Wicke about 7 years ago

File debug_rgw_deletes.log.xz debug_rgw_deletes.log.xz added
File debug_rgw_post_bucket_drop.log.xz debug_rgw_post_bucket_drop.log.xz added
File debug_rgw_puts.log.xz debug_rgw_puts.log.xz added

Some more detail on the test:
swift-bench.conf:
[bench]
auth = http://ceph2-1/auth/v1.0
user = benchmark:swift
key = XXXXXX
auth_version = 1.0

benchmark loop:
while true; do swift-bench -c 32 -C1 -l 1000 -s 200000 -n 10000 -g 100 ./swift-bench.conf; done

Benchmark is run against radosgw instance on secondary (non-master) zone. Logs are from radosgw instance on master zone. Swapping the roles did not seem to make a difference. Both master and slave radosgw instances are growing in RSS beyond 1G.

Actions

Copy link Download all files

Updated by Gabriel Wicke about 7 years ago

File debug_20_rgw_deletes.log.xz debug_20_rgw_deletes.log.xz added
File debug_20_rgw_post_bucket_delete_puts.log.xz debug_20_rgw_post_bucket_delete_puts.log.xz added

It looks like the dynamic log level change didn't work the previous time, so here are actual level 20 log snippets.

Actions

Copy link

Updated by Gabriel Wicke about 7 years ago

I did a bit more testing with valgrind. While memcheck didn't yield anything consistent, a massif run over ~30 minutes showed this allocation chain using over 40% of memory:

->40.68% (40,051,440B) 0x51E7C89: RGWDataSyncShardCR::incremental_sync() (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x51EA08B: RGWDataSyncShardCR::operate() (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x50F89FC: RGWCoroutinesStack::operate(RGWCoroutinesEnv*) (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x50FBE75: RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&) (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x50FDDCC: RGWCoroutinesManager::run(RGWCoroutine*) (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x51C754F: RGWRemoteDataLog::run_sync(int, rgw_data_sync_status&) (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x529FD47: RGWDataSyncProcessorThread::process() (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0x523C0F4: RGWRadosThread::Worker::entry() (in /usr/lib/librgw.so.2.0.0) | ->40.68% (40,051,440B) 0xF1A7492: start_thread (pthread_create.c:333) | ->40.68% (40,051,440B) 0xFF3F93D: clone (clone.S:97)

Actions

Copy link

#10

Updated by Gabriel Wicke about 7 years ago

Same with hopefully better formatting:

->40.68% (40,051,440B) 0x51E7C89: RGWDataSyncShardCR::incremental_sync() (in /usr/lib/librgw.so.2.0.0)
| ->40.68% (40,051,440B) 0x51EA08B: RGWDataSyncShardCR::operate() (in /usr/lib/librgw.so.2.0.0)
|   ->40.68% (40,051,440B) 0x50F89FC: RGWCoroutinesStack::operate(RGWCoroutinesEnv*) (in /usr/lib/librgw.so.2.0.0)
|     ->40.68% (40,051,440B) 0x50FBE75: RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&) (in /usr/lib/librgw.so.2.0.0)
|       ->40.68% (40,051,440B) 0x50FDDCC: RGWCoroutinesManager::run(RGWCoroutine*) (in /usr/lib/librgw.so.2.0.0)
|         ->40.68% (40,051,440B) 0x51C754F: RGWRemoteDataLog::run_sync(int, rgw_data_sync_status&) (in /usr/lib/librgw.so.2.0.0)
|           ->40.68% (40,051,440B) 0x529FD47: RGWDataSyncProcessorThread::process() (in /usr/lib/librgw.so.2.0.0)
|             ->40.68% (40,051,440B) 0x523C0F4: RGWRadosThread::Worker::entry() (in /usr/lib/librgw.so.2.0.0)
|               ->40.68% (40,051,440B) 0xF1A7492: start_thread (pthread_create.c:333)
|                 ->40.68% (40,051,440B) 0xFF3F93D: clone (clone.S:97)

Actions

Copy link

#11

Updated by Gabriel Wicke about 7 years ago

File massif-3.txt massif-3.txt added

Complete output from a ~2 hours massif run attached.

The top allocation groups are all the same, with similar ratios between them. Overall it looks like sync coroutines or threads are not collected / freed in a timely manner during continuous operation. They are freed on shutdown, which is why memcheck results were inconclusive.

Actions

Copy link

#12

Updated by Casey Bodley about 7 years ago

testing https://github.com/ceph/ceph/pull/14714 as a potential fix

Actions

Copy link

#13

Updated by Casey Bodley almost 7 years ago

Related to Bug #19861: multisite: memory leak on failed lease in RGWDataSyncShardCR added

Actions

Copy link

#14

Updated by Nathan Cutler almost 7 years ago

Backport set to jewel, kraken

Actions

Copy link

#15

Updated by Nathan Cutler almost 7 years ago

If this ends up being fixed by https://github.com/ceph/ceph/pull/14714 that is the same fix as for #19861 - in that case only one of the two tracker issues needs to have the backport field populated.

Actions

Copy link

#16

Updated by Casey Bodley almost 7 years ago

Nathan Cutler wrote:

If this ends up being fixed by https://github.com/ceph/ceph/pull/14714 that is the same fix as for #19861 - in that case only one of the two tracker issues needs to have the backport field populated.

right, thanks. we haven't confirmed that https://github.com/ceph/ceph/pull/14714 addresses all leaks, so i'm not comfortable closing this one - that's why i added the separate backport ticket

Actions

Copy link

#17

Updated by Yehuda Sadeh almost 7 years ago

Subject changed from heavy memory leak when multisite sync fail on 10.2.6 to rgw: heavy memory leak when multisite sync fail on 10.2.6

Actions

Copy link

#18

Updated by Orit Wasserman almost 7 years ago

Status changed from New to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #19446

rgw: heavy memory leak when multisite sync fail on 10.2.6

Updated by Haroboro Ha about 7 years ago

Updated by Nathan Cutler about 7 years ago

Updated by Casey Bodley about 7 years ago

Updated by Gabriel Wicke about 7 years ago

Updated by Gabriel Wicke about 7 years ago

Updated by Gabriel Wicke about 7 years ago

Updated by Gabriel Wicke about 7 years ago

Updated by Gabriel Wicke about 7 years ago

Updated by Gabriel Wicke about 7 years ago

Updated by Gabriel Wicke about 7 years ago

Updated by Gabriel Wicke about 7 years ago

Updated by Casey Bodley about 7 years ago

Updated by Casey Bodley almost 7 years ago

Updated by Nathan Cutler almost 7 years ago

Updated by Nathan Cutler almost 7 years ago

Updated by Casey Bodley almost 7 years ago

Updated by Yehuda Sadeh almost 7 years ago

Updated by Orit Wasserman almost 7 years ago