Project

General

Profile

Actions

Bug #46658

closed

Ceph-OSD nautilus/octopus memory leak ?

Added by Christophe Hauquiert almost 4 years ago. Updated over 3 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi everyone,

I'm operating a Ceph cluster (Octopus 15.2.4 7447c15c6ff58d7fce91843b705a268a1917325c) upgraded from Nautilus, Mimic and Luminous year after year and since I upgraded it to Nautilus and now Octopus I'm facing something that appear to be a memory leak.

My issue : hour after hour memory used by ceph-osd daemon grows until it crash due to OOM. It starts with ~2GB of ram and crashed when it consumed ~16GB.

Technical details :
- Cluster is ~2 year old and upgrade from Luminous, Mimic and Nautilus
- My cluster is running ceph Octopus 15.2.4 7447c15c6ff58d7fce91843b705a268a1917325c (including OSD, monitors, MGR, MDS)
- All my servers are running up to date Debian 10 with kernel 4.19.0-9-amd64
- I'm operating a small cluster with 4 only OSDs on SSD disks
- ~100 IOPS and 2MB/s
- I export all rados images every nights, and memory used by OSD grows very quickly during this operation
- This issue appeared just after upgrade from Mimic to Nautilus.

What I checked :
- osd_memory_target = 23546088652 (2GB) setted by osd class (ceph config set osd/class:ssd osd_memory_target 23546088652), I checked with "ceph config get osd.X osd_memory_target"
- ceph daemon osd.X dump_mempools => total memory used is ~1-2GB according to previous setting
- before crash due to OOM memory used by ceph-OSD is > 16GB but "ceph daemon osd.X dump_mempools" stull shows memory usage at ~2GB

I join a graph that show memory free on my 4 servers, I restart ceph-osd daemons twice a night and now consumption is between 2GB and 7GB.

Logs when it crashed :

Jul 20 00:00:00 [SERVER NAME] ceph-osd3996874: 2020-07-20T00:00:00.624+0000 7f6289a79700 -1 Fail to open '/proc/2600629/cmdline' error = (2) No such file or directory
Jul 20 00:00:00 [SERVER NAME] ceph-osd3996874: 2020-07-20T00:00:00.628+0000 7f6289a79700 -1 received signal: Hangup from <unknown> (PID: 2600629) UID: 0
Jul 20 00:00:00 [SERVER NAME] ceph-osd3996874: 2020-07-20T00:00:00.684+0000 7f6289a79700 -1 received signal: Hangup from pkill -1 -x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw|rbd-mirror (PID: 2600630) UID: 0
Jul 20 00:06:49 [SERVER NAME] systemd1: : Main process exited, code=killed, status=6/ABRT
Jul 20 00:06:49 [SERVER NAME] systemd1: : Failed with result 'signal'.
Jul 20 00:07:00 [SERVER NAME] systemd1: : Service RestartSec=10s expired, scheduling restart.
Jul 20 00:07:00 [SERVER NAME] systemd1: : Scheduled restart job, restart counter is at 1.
Jul 20 00:07:00 [SERVER NAME] systemd1: Stopped Ceph object storage daemon osd.0.
Jul 20 00:07:00 [SERVER NAME] systemd1: Starting Ceph object storage daemon osd.0...
Jul 20 00:07:00 [SERVER NAME] systemd1: Started Ceph object storage daemon osd.0.
Jul 20 00:07:21 [SERVER NAME] ceph-osd2618153: 2020-07-20T00:07:21.170+0000 7fad1273ae00 -1 osd.0 9901 log_to_monitors {default=true}
Jul 20 00:07:21 [SERVER NAME] ceph-osd2618153: 2020-07-20T00:07:21.206+0000 7fad0bc4b700 -1 osd.0 9901 set_numa_affinity unable to identify public interface 'enp3s0' numa node: (0) Success

My questions :
- How can I confirm memory leak ? Is "ceph daemon osd.X dump_mempools" supposed to show total memory used by ceph-osd daemon ?
- Is there something wrong in my configuration ? (I just setted osd_memory_target to 2GB)

I'll try to reproduce this issue on a install from scratch on a test deployment.

Thanks for your help :)


Files

ceph-oom.png (60.2 KB) ceph-oom.png Christophe Hauquiert, 07/21/2020 01:17 PM
Actions

Also available in: Atom PDF