Project

General

Profile

Actions

Support #22224

closed

memory leak

Added by yair mackenzi over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Component(RADOS):
Pull request ID:

Description

Hello
We have a fresh 'luminous' ( 12.2.0 ) (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc) ( installed using ceph-ansible )
the cluster contains 6 * Intel server board S2600WTTR ( 96 osds and 3 mons )
We have 6 nodes ( Intel server board S2600WTTR ) , Mem - 64G , CPU -> Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz , 32 cores .
Each server has 16 * 1.6TB Dell SSD drives ( SSDSC2BB016T7R ) , total of 96 osds , 3 mons
The main usage is rbd's for our OpenStack environment ( Okata )
We face an issue where OSD's go down following memory leak issue :

Nov 21 08:18:04 ecprdbcph13-opens systemd1: : Main process exited, code=killed, status=11/SEGV
Nov 21 08:18:04 ecprdbcph13-opens systemd1: : Unit entered failed state.
Nov 21 08:18:04 ecprdbcph13-opens systemd1: : Failed with result 'signal'.
Nov 21 08:18:24 ecprdbcph13-opens systemd1: : Service hold-off time over, scheduling restart.
Nov 21 08:19:24 ecprdbcph13-opens systemd1: : Main process exited, code=killed, status=6/ABRT
Nov 21 08:19:24 ecprdbcph13-opens systemd1: : Unit entered failed state.
Nov 21 08:19:24 ecprdbcph13-opens systemd1: : Failed with result 'signal'

Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: 0> 2017-11-22 08:50:32.659438 7f9f338e7700 -1 ** Caught signal (Aborted) *
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: in thread 7f9f338e7700 thread_name:osd_srv_heartbt
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: 1: (()+0xa5a634) [0x55e993d69634]
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: 2: (()+0x11390) [0x7f9f54447390]
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: 3: (gsignal()+0x38) [0x7f9f533e2428]
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: 4: (abort()+0x16a) [0x7f9f533e402a]
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: 5: (_gnu_cxx::_verbose_terminate_handler()+0x16d) [0x7f9f53d2584d]
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: 6: (()+0x8d6b6) [0x7f9f53d236b6]
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: 7: (()+0x8d701) [0x7f9f53d23701]
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: 8: (()+0x8d919) [0x7f9f53d23919]
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: 9: (()+0x3e2711) [0x55e9936f1711]
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: 10: (ceph::buffer::list::append(char const*, unsigned int)+0x27c) [0x55e993d72a8c]
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: 11: (MOSDPing::encode_payload(unsigned long)+0x3d) [0x55e9937f968d]
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: 12: (Message::encode(unsigned long, int)+0x29) [0x55e993de4c99]
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: 13: (AsyncConnection::prepare_send_message(unsigned long, Message*, ceph::buffer::list&)+0x30e) [0x55e99409549e]
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: 14: (AsyncConnection::send_message(Message*)+0x4aa) [0x55e99409a95a]
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: 15: (OSD::heartbeat()+0x863) [0x55e9937d1373]
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: 16: (OSD::heartbeat_entry()+0x36d) [0x55e9937d222d]
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: 17: (OSD::T_Heartbeat::entry()+0xd) [0x55e9938341cd]
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: 18: (()+0x76ba) [0x7f9f5443d6ba]
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: 19: (clone()+0x6d) [0x7f9f534b43dd]
Nov 22 08:50:32 ecprdbcph10-opens ceph-osd2324222: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Nov 22 08:50:32 ecprdbcph10-opens kernel: [1213706.042733] Core dump to |/usr/share/apport/apport 2324222 6 0 2324222 pipe failed
Nov 22 08:50:32 ecprdbcph10-opens systemd1: : Main process exited, code=killed, status=6/ABRT
Nov 22 08:50:32 ecprdbcph10-opens systemd1: : Unit entered failed state.
Nov 22 08:50:32 ecprdbcph10-opens systemd1: : Failed with result 'signal'.
.

also once it's happening we can't even run ceph monitor commands :

root@ecprdbcph10-opens:/var/log/ceph# ceph osd tree
Traceback (most recent call last):
File "/usr/bin/ceph", line 125, in <module>
import rados
ImportError: libceph-common.so.0: cannot map zero-fill pages

ceph-post-file: 7de983c4-3e95-4515-b359-9ac7e565fdd0

Thanks

Actions

Also available in: Atom PDF