Project

General

Profile

Bug #46658

Ceph-OSD nautilus/octopus memory leak ?

Added by Christophe Hauquiert 18 days ago. Updated 8 days ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature:

Description

Hi everyone,

I'm operating a Ceph cluster (Octopus 15.2.4 7447c15c6ff58d7fce91843b705a268a1917325c) upgraded from Nautilus, Mimic and Luminous year after year and since I upgraded it to Nautilus and now Octopus I'm facing something that appear to be a memory leak.

My issue : hour after hour memory used by ceph-osd daemon grows until it crash due to OOM. It starts with ~2GB of ram and crashed when it consumed ~16GB.

Technical details :
- Cluster is ~2 year old and upgrade from Luminous, Mimic and Nautilus
- My cluster is running ceph Octopus 15.2.4 7447c15c6ff58d7fce91843b705a268a1917325c (including OSD, monitors, MGR, MDS)
- All my servers are running up to date Debian 10 with kernel 4.19.0-9-amd64
- I'm operating a small cluster with 4 only OSDs on SSD disks
- ~100 IOPS and 2MB/s
- I export all rados images every nights, and memory used by OSD grows very quickly during this operation
- This issue appeared just after upgrade from Mimic to Nautilus.

What I checked :
- osd_memory_target = 23546088652 (2GB) setted by osd class (ceph config set osd/class:ssd osd_memory_target 23546088652), I checked with "ceph config get osd.X osd_memory_target"
- ceph daemon osd.X dump_mempools => total memory used is ~1-2GB according to previous setting
- before crash due to OOM memory used by ceph-OSD is > 16GB but "ceph daemon osd.X dump_mempools" stull shows memory usage at ~2GB

I join a graph that show memory free on my 4 servers, I restart ceph-osd daemons twice a night and now consumption is between 2GB and 7GB.

Logs when it crashed :

Jul 20 00:00:00 [SERVER NAME] ceph-osd3996874: 2020-07-20T00:00:00.624+0000 7f6289a79700 -1 Fail to open '/proc/2600629/cmdline' error = (2) No such file or directory
Jul 20 00:00:00 [SERVER NAME] ceph-osd3996874: 2020-07-20T00:00:00.628+0000 7f6289a79700 -1 received signal: Hangup from <unknown> (PID: 2600629) UID: 0
Jul 20 00:00:00 [SERVER NAME] ceph-osd3996874: 2020-07-20T00:00:00.684+0000 7f6289a79700 -1 received signal: Hangup from pkill -1 -x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw|rbd-mirror (PID: 2600630) UID: 0
Jul 20 00:06:49 [SERVER NAME] systemd1: : Main process exited, code=killed, status=6/ABRT
Jul 20 00:06:49 [SERVER NAME] systemd1: : Failed with result 'signal'.
Jul 20 00:07:00 [SERVER NAME] systemd1: : Service RestartSec=10s expired, scheduling restart.
Jul 20 00:07:00 [SERVER NAME] systemd1: : Scheduled restart job, restart counter is at 1.
Jul 20 00:07:00 [SERVER NAME] systemd1: Stopped Ceph object storage daemon osd.0.
Jul 20 00:07:00 [SERVER NAME] systemd1: Starting Ceph object storage daemon osd.0...
Jul 20 00:07:00 [SERVER NAME] systemd1: Started Ceph object storage daemon osd.0.
Jul 20 00:07:21 [SERVER NAME] ceph-osd2618153: 2020-07-20T00:07:21.170+0000 7fad1273ae00 -1 osd.0 9901 log_to_monitors {default=true}
Jul 20 00:07:21 [SERVER NAME] ceph-osd2618153: 2020-07-20T00:07:21.206+0000 7fad0bc4b700 -1 osd.0 9901 set_numa_affinity unable to identify public interface 'enp3s0' numa node: (0) Success

My questions :
- How can I confirm memory leak ? Is "ceph daemon osd.X dump_mempools" supposed to show total memory used by ceph-osd daemon ?
- Is there something wrong in my configuration ? (I just setted osd_memory_target to 2GB)

I'll try to reproduce this issue on a install from scratch on a test deployment.

Thanks for your help :)

ceph-oom.png View (60.2 KB) Christophe Hauquiert, 07/21/2020 01:17 PM

History

#1 Updated by Igor Fedotov 18 days ago

@Christophe - you might be interested in the following thread: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/TPIFMPQ6YHEK4GYH5LA6NWGRFXVW44MB/

Also wondering if you can do memory profiling as per: https://docs.ceph.com/docs/octopus/rados/troubleshooting/memory-profiling/

#2 Updated by Christophe Hauquiert 15 days ago

Igor Fedotov wrote:

@Christophe - you might be interested in the following thread: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/TPIFMPQ6YHEK4GYH5LA6NWGRFXVW44MB/

Also wondering if you can do memory profiling as per: https://docs.ceph.com/docs/octopus/rados/troubleshooting/memory-profiling/

Hi @Igor,

Thank you for your quick feedback (and sorry I was very busy last days), I read the thread and indeed we are facing the same issue, but it seems unresolved yet...
I will try to activate memory profiling and come back with results.

#3 Updated by Neha Ojha 9 days ago

  • Status changed from New to Need More Info

#4 Updated by Adam Kupczyk 9 days ago

@Christophe:
Can you share dump_mempools?
Regarding "osd_memory_target = 23546088652": please note that 23546088652 is ~2.3e10, not 2.3e9.

#5 Updated by Christophe Hauquiert 9 days ago

Adam Kupczyk wrote:

@Christophe:
Can you share dump_mempools?
Regarding "osd_memory_target = 23546088652": please note that 23546088652 is ~2.3e10, not 2.3e9.

oO you're right, I'll fix the value.

Just now my ceph-osd daemon is consuming ~3.8GB

  1. ceph daemon osd.3 dump_mempools {
    "mempool": {
    "by_pool": {
    "bloom_filter": {
    "items": 0,
    "bytes": 0
    },
    "bluestore_alloc": {
    "items": 2679100,
    "bytes": 81838352
    },
    "bluestore_cache_data": {
    "items": 16145,
    "bytes": 505143296
    },
    "bluestore_cache_onode": {
    "items": 492019,
    "bytes": 314892160
    },
    "bluestore_cache_other": {
    "items": 49707795,
    "bytes": 760594849
    },
    "bluestore_fsck": {
    "items": 0,
    "bytes": 0
    },
    "bluestore_txc": {
    "items": 2,
    "bytes": 1488
    },
    "bluestore_writing_deferred": {
    "items": 5,
    "bytes": 22314
    },
    "bluestore_writing": {
    "items": 12,
    "bytes": 49152
    },
    "bluefs": {
    "items": 855,
    "bytes": 26008
    },
    "buffer_anon": {
    "items": 31997,
    "bytes": 10522303
    },
    "buffer_meta": {
    "items": 45914,
    "bytes": 4040432
    },
    "osd": {
    "items": 142,
    "bytes": 1835776
    },
    "osd_mapbl": {
    "items": 21,
    "bytes": 132967
    },
    "osd_pglog": {
    "items": 528122,
    "bytes": 180678992
    },
    "osdmap": {
    "items": 1856,
    "bytes": 139664
    },
    "osdmap_mapping": {
    "items": 0,
    "bytes": 0
    },
    "pgmap": {
    "items": 0,
    "bytes": 0
    },
    "mds_co": {
    "items": 0,
    "bytes": 0
    },
    "unittest_1": {
    "items": 0,
    "bytes": 0
    },
    "unittest_2": {
    "items": 0,
    "bytes": 0
    }
    },
    "total": {
    "items": 53503985,
    "bytes": 1859917753
    }
    }
    }

#6 Updated by Christophe Hauquiert 9 days ago

Adam Kupczyk wrote:

@Christophe:
Can you share dump_mempools?
Regarding "osd_memory_target = 23546088652": please note that 23546088652 is ~2.3e10, not 2.3e9.

Not sure that is related to my issue, because I was running ceph with default configuration (4GB memory target) but it grows to 16GB and crashed.
I fixed the value and I will observe memory next hours.

#7 Updated by Christophe Hauquiert 8 days ago

Christophe Hauquiert wrote:

Adam Kupczyk wrote:

@Christophe:
Can you share dump_mempools?
Regarding "osd_memory_target = 23546088652": please note that 23546088652 is ~2.3e10, not 2.3e9.

Not sure that is related to my issue, because I was running ceph with default configuration (4GB memory target) but it grows to 16GB and crashed.
I fixed the value and I will observe memory next hours.

Memory is still increasing after I fixed memory target, 5GB today (last OSD restart 15h ago)
Tomorrow I will try to activate memory profiling as you requested.

Also available in: Atom PDF