Project

General

Profile

Actions

Bug #1152

closed

Mon getting killed by OOM killer

Added by Wido den Hollander almost 13 years ago. Updated almost 13 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Monitor
Target version:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I've been seeing this for the last few weeks, my single mon keeps eating more and more memory until it reaches 4G Res and gets killed by the OOM-killer.

Somewhere in late April my cluster (40 OSD's) started to crash and wouldn't recover since then. I've been pretty persistent not to wipe it clean, so I'm still trying to start it up again and see what happends.

So my OSD's are bouncing up and down, but in this process the mon keeps eating more and more memory.

root@monitor:~# dmesg 
[4941536.986482] cmon invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
[4941536.986487] cmon cpuset=/ mems_allowed=0
[4941536.986490] Pid: 18311, comm: cmon Tainted: P           2.6.32-30-server #59-Ubuntu
[4941536.986492] Call Trace:
[4941536.986501]  [<ffffffff810b483d>] ? cpuset_print_task_mems_allowed+0x9d/0xb0
[4941536.986506]  [<ffffffff810f8bc4>] oom_kill_process+0xd4/0x2f0
[4941536.986510]  [<ffffffff810f9180>] ? select_bad_process+0xd0/0x110
[4941536.986513]  [<ffffffff810f9218>] __out_of_memory+0x58/0xc0
[4941536.986516]  [<ffffffff810f93ae>] out_of_memory+0x12e/0x1a0
[4941536.986521]  [<ffffffff8155e2ce>] ? _spin_lock+0xe/0x20
[4941536.986524]  [<ffffffff810fc511>] __alloc_pages_slowpath+0x571/0x590
[4941536.986528]  [<ffffffff810fc6a1>] __alloc_pages_nodemask+0x171/0x180
[4941536.986533]  [<ffffffff8112f777>] alloc_pages_current+0x87/0xd0
[4941536.986536]  [<ffffffff810f6387>] __page_cache_alloc+0x67/0x70
[4941536.986539]  [<ffffffff810fff39>] __do_page_cache_readahead+0xc9/0x210
[4941536.986543]  [<ffffffff811000a1>] ra_submit+0x21/0x30
[4941536.986546]  [<ffffffff810f7c5e>] filemap_fault+0x3fe/0x450
[4941536.986550]  [<ffffffff81114534>] __do_fault+0x54/0x500
[4941536.986554]  [<ffffffff81117ae8>] handle_mm_fault+0x1a8/0x3c0
[4941536.986558]  [<ffffffff811621c0>] ? mntput_no_expire+0x30/0x110
[4941536.986562]  [<ffffffff81560e45>] do_page_fault+0x125/0x3b0
[4941536.986565]  [<ffffffff8155e795>] page_fault+0x25/0x30
[4941536.986567] Mem-Info:
[4941536.986569] Node 0 DMA per-cpu:
[4941536.986571] CPU    0: hi:    0, btch:   1 usd:   0
[4941536.986574] CPU    1: hi:    0, btch:   1 usd:   0
[4941536.986576] CPU    2: hi:    0, btch:   1 usd:   0
[4941536.986578] CPU    3: hi:    0, btch:   1 usd:   0
[4941536.986580] Node 0 DMA32 per-cpu:
[4941536.986583] CPU    0: hi:  186, btch:  31 usd:  30
[4941536.986585] CPU    1: hi:  186, btch:  31 usd:   7
[4941536.986587] CPU    2: hi:  186, btch:  31 usd:  33
[4941536.986589] CPU    3: hi:  186, btch:  31 usd: 141
[4941536.986591] Node 0 Normal per-cpu:
[4941536.986594] CPU    0: hi:  186, btch:  31 usd:  42
[4941536.986596] CPU    1: hi:  186, btch:  31 usd:   0
[4941536.986598] CPU    2: hi:  186, btch:  31 usd:  30
[4941536.986601] CPU    3: hi:  186, btch:  31 usd: 124
[4941536.986605] active_anon:756721 inactive_anon:218519 isolated_anon:0
[4941536.986607]  active_file:184 inactive_file:244 isolated_file:0
[4941536.986608]  unevictable:0 dirty:0 writeback:3 unstable:0
[4941536.986609]  free:6710 slab_reclaimable:1166 slab_unreclaimable:2883
[4941536.986610]  mapped:234 shmem:2 pagetables:2893 bounce:0
[4941536.986612] Node 0 DMA free:15856kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15276kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[4941536.986621] lowmem_reserve[]: 0 3254 4012 4012
[4941536.986625] Node 0 DMA32 free:9472kB min:6560kB low:8200kB high:9840kB active_anon:2690984kB inactive_anon:538168kB active_file:456kB inactive_file:620kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3332768kB mlocked:0kB dirty:4kB writeback:0kB mapped:360kB shmem:0kB slab_reclaimable:1732kB slab_unreclaimable:1956kB kernel_stack:208kB pagetables:7396kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:374 all_unreclaimable? yes
[4941536.986635] lowmem_reserve[]: 0 0 757 757
[4941536.986639] Node 0 Normal free:1512kB min:1524kB low:1904kB high:2284kB active_anon:335900kB inactive_anon:335908kB active_file:280kB inactive_file:356kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:775680kB mlocked:0kB dirty:0kB writeback:12kB mapped:576kB shmem:8kB slab_reclaimable:2932kB slab_unreclaimable:9576kB kernel_stack:1552kB pagetables:4176kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1044 all_unreclaimable? no
[4941536.986650] lowmem_reserve[]: 0 0 0 0
[4941536.986653] Node 0 DMA: 2*4kB 1*8kB 2*16kB 2*32kB 2*64kB 2*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15856kB
[4941536.986664] Node 0 DMA32: 1386*4kB 0*8kB 6*16kB 6*32kB 5*64kB 3*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 9352kB
[4941536.986673] Node 0 Normal: 378*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1512kB
[4941536.986683] 3825 total pagecache pages
[4941536.986685] 3355 pages in swap cache
[4941536.986687] Swap cache stats: add 11662870, delete 11659515, find 1323001/2079551
[4941536.986689] Free swap  = 0kB
[4941536.986691] Total swap = 1052664kB
[4941536.999070] 1048560 pages RAM
[4941536.999073] 34555 pages reserved
[4941536.999075] 956 pages shared
[4941536.999078] 1006238 pages non-shared
[4941536.999082] Out of memory: kill process 18303 (cmon) score 6569 or a child
[4941537.011910] Killed process 18303 (cmon)

My mon is running with 'debug mon = 20', but I don't know if this gives enough information.

I tried the [[http://ceph.newdream.net/wiki/Memory_Profiling|memory profiling]] as described in the Wiki, but these commands do not work with a monitor, but since a few weeks it's linked to tcmalloc, so in theory it should work, right?

I don't know how to track this down, but it seems like something serious to me.

Let me know what to try to hunt this one down.


Files

mon_heap.tar.gz (3.01 MB) mon_heap.tar.gz Heap dumps from monitor Wido den Hollander, 06/10/2011 01:31 PM
Actions

Also available in: Atom PDF