https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2017-10-13T14:30:51ZCeph Ceph - Bug #21761: ceph-osd consumes way too much memory during recoveryhttps://tracker.ceph.com/issues/21761?journal_id=1007592017-10-13T14:30:51ZWyllys Ingersollwyllys.ingersoll@keepertech.com
<ul></ul><p>Here are some heap stats from one of the OSDs that is currently consuming about 21GB of RAM according to "top":</p>
<pre>
osd.31 tcmalloc heap stats:------------------------------------------------
MALLOC: 17991446352 (17158.0 MiB) Bytes in use by application
MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist
MALLOC: + 305843008 ( 291.7 MiB) Bytes in central cache freelist
MALLOC: + 81920 ( 0.1 MiB) Bytes in transfer cache freelist
MALLOC: + 105874800 ( 101.0 MiB) Bytes in thread cache freelists
MALLOC: + 66924704 ( 63.8 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 18470170784 (17614.5 MiB) Actual memory used (physical + swap)
MALLOC: + 4306345984 ( 4106.9 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 22776516768 (21721.4 MiB) Virtual address space used
MALLOC:
MALLOC: 682446 Spans in use
MALLOC: 791 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
</pre> Ceph - Bug #21761: ceph-osd consumes way too much memory during recoveryhttps://tracker.ceph.com/issues/21761?journal_id=1011722017-10-20T19:03:44ZSage Weilsage@newdream.net
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Need More Info</i></li></ul><p>We fixed a problem that could lead to a similar scenario recently. When this happens, do you have lots of PGs in the recovery_wait state? If so, can you look at<br />'ceph pg dump | grep recovery_wait' and see if the LOG column is big (> 10000)? If so, this is a priority inversion problem. We just fixed it in luminous (12.2.2, not out yet). A workaround is to set osd_max_backfills to a larger value to get recovery started on the OSDs that have recovery_wait PGs. Once those PGs recover and it's just backfill the memory usage will drop.</p> Ceph - Bug #21761: ceph-osd consumes way too much memory during recoveryhttps://tracker.ceph.com/issues/21761?journal_id=1011732017-10-20T19:04:10ZSage Weilsage@newdream.net
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-3 priority-6 priority-high2 closed" href="/issues/21331">Bug #21331</a>: pg recovery priority inversion</i> added</li></ul> Ceph - Bug #21761: ceph-osd consumes way too much memory during recoveryhttps://tracker.ceph.com/issues/21761?journal_id=1011752017-10-20T19:16:11ZWyllys Ingersollwyllys.ingersoll@keepertech.com
<ul></ul><p>The problem has since sorted itself out (though it took over a week to get back to HEALTH_OK). I believe the recovery_wait number was in the hundreds or maybe 1000-2000 range the peak.</p>
<p>We ended up tuning a bunch of things down to make the memory issue manageable. This was all triggered by removing a very old snapshot from cephfs that had many TB of changes added to the system after it was originally taken. These are the parameters we set, not sure if they were all effective or not, its sometimes hard to see the impact, but this is where we ended up getting healthy. Clearly our max_backfills value was too low, perhaps it would have gone faster with a larger value.</p>
<p>osd map cache size = 200<br />osd map max advance = 100<br />osd map share max epochs = 100<br />osd pg epoch persisted max stale = 100</p>
<p>osd_recovery_max_active = 4<br />osd_recovery_threads = 1<br />osd_recovery_max_single_start = 1<br />osd_recovery_op_priority = 2<br />osd_op_threads = 8</p>
<p>osd_max_backfills = 2<br />osd_backfill_scan_max = 16<br />osd_backfill_scan_min = 4</p>
<p>osd_snap_trim_sleep = 0.1<br />osd_pg_max_concurrent_snap_trims = 1<br />osd_max_trimming_pgs = 1</p>
<p>osd_max_pg_log_entries = 1000<br />osd_min_pg_log_entries = 256<br />osd_pg_log_trim_min = 200</p> Ceph - Bug #21761: ceph-osd consumes way too much memory during recoveryhttps://tracker.ceph.com/issues/21761?journal_id=1196292018-08-29T21:50:01ZSage Weilsage@newdream.net
<ul><li><strong>Status</strong> changed from <i>Need More Info</i> to <i>Can't reproduce</i></li></ul>