Project

General

Profile

Actions

Bug #37788

closed

ceph osd process run out of memory

Added by chandler bing over 5 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello everyone,
We test ceph 13.2.2 in our server using rbd service, we find the osd process restart in runtime When we test 4k sequential reading.
We have three servers as storage pool and use another server to test rbd service.We create 8 image in test server and use fio to read and write these images.

Configuration of each server is as follow: 
OS:CentOs 7.5
drivers:35 4T HDD
nvme: 1.6T nvme ssd as wal log and db
memory:256GB
We check the log of the crashing osd process, it seems like the process can't get required memory that leads to crash. Attachment contains the detailed stack info.
We use the google-perftools to check the heap usage of ecah osd and find the heap freelist comsumed alomost 5~6GB per osd. We try to use command "ceph osd tell osd.* heap release" but it doesn't work, the memory still occupied by osd process.
The os on our server is centos7.5 and which pagesize is 64kb on default. We guess the huge memory consumption is related to the hugepage, so we recompiled the kernel image by setting the pagesize to 4kb, the total memory comsumption is 23GB when we use the new kernel to test 4kb sequential reading.The osd process will not restart again.
I am wondering why the osd consume that huge memory when pagesize is 64kb, is it a tcmalloc bug or something else? Is there a config in ceph or tcmalloc that can limit the memory usage of each osd ?

Files

call stack.png (62.4 KB) call stack.png chandler bing, 01/04/2019 08:29 AM
heap usage in osd.png (138 KB) heap usage in osd.png chandler bing, 01/04/2019 08:29 AM
Actions #1

Updated by chandler bing over 5 years ago

sorry about the layout of the post, I am not sure why the text become the picture.

Actions #2

Updated by Greg Farnum over 5 years ago

  • Project changed from Ceph to RADOS
Actions #3

Updated by Greg Farnum over 5 years ago

  • Status changed from New to Closed

Unfortunately there are a number of known issues with tcmalloc and hugepages. I don't think I've seen it this bad before but there are comments at eg https://github.com/gperftools/gperftools/issues/535#issuecomment-362883736 on the subject. :(

Actions

Also available in: Atom PDF