https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2013-01-22T09:22:17ZCeph Ceph - Bug #3883: osd: leaks memory (possibly triggered by scrubbing) on argonauthttps://tracker.ceph.com/issues/3883?journal_id=161842013-01-22T09:22:17ZSage Weilsage@newdream.net
<ul><li><strong>Priority</strong> changed from <i>Urgent</i> to <i>High</i></li></ul> Ceph - Bug #3883: osd: leaks memory (possibly triggered by scrubbing) on argonauthttps://tracker.ceph.com/issues/3883?journal_id=165742013-01-30T01:16:14ZSage Weilsage@newdream.net
<ul><li><strong>Subject</strong> changed from <i>osd: new memory leak in wip-f?</i> to <i>osd: leaks memory (possibly triggered by scrubbing)</i></li><li><strong>Description</strong> updated (<a title="View differences" href="/journals/16574/diff?detail_id=20360">diff</a>)</li><li><strong>Priority</strong> changed from <i>High</i> to <i>Urgent</i></li><li><strong>Source</strong> changed from <i>Development</i> to <i>Community (user)</i></li></ul> Ceph - Bug #3883: osd: leaks memory (possibly triggered by scrubbing) on argonauthttps://tracker.ceph.com/issues/3883?journal_id=166042013-01-30T14:13:15ZSamuel Justsjust@redhat.com
<ul></ul><p>The burnupi57 cluster (wip-f) does not appear to be leaking after all, the osds seem to have leveled off at around 350MB of memory. I left the profiler running for two days for osd.1 and the heap appears to have not grown at all in that time on osd.1. The burnupi 15 cluster running wip_bobtail_f (56 series) also does not appear to be leaking (still watching). Both appear to be scrubbing regularly.</p> Ceph - Bug #3883: osd: leaks memory (possibly triggered by scrubbing) on argonauthttps://tracker.ceph.com/issues/3883?journal_id=166112013-01-31T05:18:33ZSylvain Munauts.munaut@whatever-company.com
<ul><li><strong>File</strong> <a href="/attachments/download/660/osd_mem_noscrub.png">osd_mem_noscrub.png</a> <a class="icon-only icon-magnifier" title="View" href="/attachments/660/osd_mem_noscrub.png">View</a> added</li></ul><p>I disabled scrubbing using</p>
<blockquote>
<p>ceph osd tell \* injectargs '--osd-scrub-min-interval 1000000'<br />ceph osd tell \* injectargs '--osd-scrub-max-interval 10000000'</p>
</blockquote>
<p>and the leak seems to be gone.</p>
<p>See the graph with the OSD memory for the 12 osd processes over the last 3.5 days. Memory was rising every 24h. I did the change yesterday around 13h00 and OSDs stopped growing. OSD memory even seems to go down slowly by small blocks.</p> Ceph - Bug #3883: osd: leaks memory (possibly triggered by scrubbing) on argonauthttps://tracker.ceph.com/issues/3883?journal_id=169692013-02-08T06:16:42ZWido den Hollanderwido@42on.com
<ul></ul><p>I've also started to see this and will try to get some heap profiling done to report back.</p>
<ul>
<li>ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7</li>
<li>kernel version 3.2.0-36-generic</li>
<li>OSD filesystem: btrfs</li>
</ul>
<p>In this case it's RBD running behind CloudStack, so regular KVM, no kernel RBD.</p>
<p>It was a 32GB machine with just two OSDs which started swapping (16GB) and eventually went OOM.</p>
<p>To be sure that I collect the right information:</p>
<p>I see osd.7 go wild and do:</p>
<pre>ceph osd tell 7 heap start_profiler
ceph osd tell 7 heap dump
ceph osd tell 7 heap stop_profiler
kill -SEGV <osd.7's pid></pre>
<p>That would give me a heap file and core, correct?</p> Ceph - Bug #3883: osd: leaks memory (possibly triggered by scrubbing) on argonauthttps://tracker.ceph.com/issues/3883?journal_id=169992013-02-08T16:45:35ZJosh Durgin
<ul></ul><p>On wip-f, one osd grew to consume 70% of ram. The heap profiler tells us:</p>
<pre>
(pprof) top20
Total: 2.5 MB
1.9 75.4% 75.4% 1.9 75.4% ceph::log::Log::create_entry
0.6 24.5% 99.9% 0.6 24.5% std::string::_Rep::_S_create
0.0 0.1% 100.0% 0.0 0.3% ceph_heap_profiler_handle_command
0.0 0.0% 100.0% 0.0 0.0% std::deque::_M_push_back_aux
0.0 0.0% 100.0% 0.0 0.2% LogClient::do_log
0.0 0.0% 100.0% 0.0 0.0% MMonCommand::print
0.0 0.0% 100.0% 0.0 0.0% MMonSubscribeAck::print
0.0 0.0% 100.0% 0.0 0.0% MOSDMap::print
0.0 0.0% 100.0% 0.1 4.1% MOSDPing::print
</pre> Ceph - Bug #3883: osd: leaks memory (possibly triggered by scrubbing) on argonauthttps://tracker.ceph.com/issues/3883?journal_id=170002013-02-08T16:48:26ZGreg Farnumgfarnum@redhat.com
<ul></ul><p>Oh, how interesting...I wonder if this is syslog not having enough network bandwidth? Or (in the more general sense) logging messages not getting out of the OSD process and into the target due to throughput issues which are triggered by additional activity from scrub?</p> Ceph - Bug #3883: osd: leaks memory (possibly triggered by scrubbing) on argonauthttps://tracker.ceph.com/issues/3883?journal_id=170012013-02-08T16:49:28ZGreg Farnumgfarnum@redhat.com
<ul></ul><p>Something like that (or some kind of bug in the logging system that only gets hit with syslog or when not logging) would explain why we've been unable to reproduce locally but others have been seeing it so consistently, fyi.</p> Ceph - Bug #3883: osd: leaks memory (possibly triggered by scrubbing) on argonauthttps://tracker.ceph.com/issues/3883?journal_id=170022013-02-08T16:55:23ZJosh Durgin
<ul></ul><p>A log dump shows nothing, so I'm guessing the log is corrupted such that it keeps logging to more and more memory without bound.</p>
<pre>
ubuntu@burnupi57:~$ sudo ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok log dump
ubuntu@burnupi57:~$
</pre>
<p>It's just using default log settings, so there's no syslog involved.</p> Ceph - Bug #3883: osd: leaks memory (possibly triggered by scrubbing) on argonauthttps://tracker.ceph.com/issues/3883?journal_id=170042013-02-08T16:58:17ZJosh Durgin
<ul></ul><pre>
(gdb) p *(g_ceph_context->_log)
$4 = {<Thread> = {_vptr.Thread = 0xb42910, thread_id = 140568769971968}, m_indirect_this = 0x3300010, m_subs = 0x330d090, m_lock = 1, m_queue_mutex = {__data = {
__lock = 0, __count = 0, __owner = 0, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = '\000' <repeats 12 times>, "\001", '\000' <repeats 26 times>, __align = 0}, m_flush_mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
__nusers = 4294967290, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = '\000' <repeats 12 times>"\372, \377\377\377", '\000' <repeats 23 times>, __align = 0}, m_cond = {__data = {__lock = 0, __futex = 322509383,
__total_seq = 161254692, __wakeup_seq = 161254691, __woken_seq = 161254691, __mutex = 0x3310028, __nwaiters = 2, __broadcast_seq = 0},
__size = "\000\000\000\000G\032\071\023$\215\234\t\000\000\000\000#\215\234\t\000\000\000\000#\215\234\t\000\000\000\000(\000\061\003\000\000\000\000\002\000\000\000\000\000\000", __align = 1385167252638138368}, m_new = {m_len = 0, m_head = 0x0, m_tail = 0x0}, m_recent = {m_len = 1934, m_head = 0x8efea90, m_tail = 0x6719dd0},
m_log_file = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
_M_p = 0x330a858 "/var/log/ceph/ceph-osd.1.log"}}, m_fd = 6, m_syslog_log = -2, m_syslog_crash = -2, m_stderr_log = -1, m_stderr_crash = -1, m_stop = false,
m_max_new = 1000000, m_max_recent = 10000}
</pre> Ceph - Bug #3883: osd: leaks memory (possibly triggered by scrubbing) on argonauthttps://tracker.ceph.com/issues/3883?journal_id=170092013-02-08T17:19:51ZSage Weilsage@newdream.net
<ul></ul><p>the bit that looks fishy here is m_flush_mutex.__nusers. can you see what that thread is doing in gdb?</p>
<p>maybe it's not being initialized properly or something...</p> Ceph - Bug #3883: osd: leaks memory (possibly triggered by scrubbing) on argonauthttps://tracker.ceph.com/issues/3883?journal_id=170542013-02-11T09:41:10ZIan Colleicolle@redhat.com
<ul><li><strong>Assignee</strong> set to <i>Josh Durgin</i></li></ul> Ceph - Bug #3883: osd: leaks memory (possibly triggered by scrubbing) on argonauthttps://tracker.ceph.com/issues/3883?journal_id=172112013-02-13T11:21:37ZSage Weilsage@newdream.net
<ul><li><strong>Subject</strong> changed from <i>osd: leaks memory (possibly triggered by scrubbing)</i> to <i>osd: leaks memory (possibly triggered by scrubbing) on argonaut</i></li><li><strong>Status</strong> changed from <i>12</i> to <i>Won't Fix</i></li></ul> Ceph - Bug #3883: osd: leaks memory (possibly triggered by scrubbing) on argonauthttps://tracker.ceph.com/issues/3883?journal_id=213832013-05-02T16:38:24ZVladislav Gorbunovvadikgo@gmail.com
<ul></ul><p>simple workaround<br />run this command by cron, every 5 minutes:<br /><pre>
for osd in $(cat /var/run/ceph/osd.*.pid | xargs -I {} ps -p {} -o pid,\%mem,args|grep -v PID | awk '{if ($2 > 50) print "memory leak on process: "$0}'|awk '{print $9}'); do ceph osd set noout && service ceph stop osd.$osd && sleep 30 && service ceph start osd.$osd && ceph osd unset noout; done
</pre><br />this script restart the osd process if used memory >50%</p>