Project

General

Profile

Actions

Bug #37388

open

rgw: memory leak with multisite sync

Added by Dieter Roels over 5 years ago. Updated over 5 years ago.

Status:
In Progress
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Ever since we started to use ceph with multisite sync (around the 12.2.2 time I think) we noticed that the rgw memory footprint keeps slowly growing. This resulted in an OOM kill every few days/weeks. We tested with more memory in the VMs but even with 32GB memory the OOM's occured, so we settled for VMs with 8GB memory, and had OOm's about once a week. Clients did not really notice because of the loadbalancers.

Recently we tested the mimic release, and noticed the leak is substantially worse in mimic. Our current test evnvironment is running 13.2.2, has no objects, so no client connections other then the multisite sync. The rgws get OOM kills about once a day on 8GB VMs. We tested with 32GB VMs and they show the same memory growth, but they last for a few days before OOM.

So, my question is, how can we test this memory leak? I did run it with valgrind once, but it throws lots of errors and seems not to be compatible with jemalloc. And it seems rgw does not keep memory statistics like the other daemons?

Probably usefull info: we run civetweb with ssl on rhel


Related issues 1 (1 open0 closed)

Related to rgw - Bug #23375: Memory leak in RGW when libcurl is configured with --with-nss and performing https requests to keystoneIn ProgressMark Kogan03/15/2018

Actions
Actions #1

Updated by Casey Bodley over 5 years ago

  • Assignee set to Marcus Watts

This sounds like the issues we saw with libcurl+nss in rhel. Marcus, do you remember the details around that?

Actions #2

Updated by Casey Bodley over 5 years ago

  • Assignee changed from Marcus Watts to Mark Kogan

Oops, I see that it was Mark who worked on this in http://tracker.ceph.com/issues/23375 and https://github.com/ceph/ceph/pull/20924.

Actions #3

Updated by Casey Bodley over 5 years ago

  • Related to Bug #23375: Memory leak in RGW when libcurl is configured with --with-nss and performing https requests to keystone added
Actions #4

Updated by Dieter Roels over 5 years ago

Might be related, but afaik we do not use keystone. Our current testenvironment on mimic has maybe 2 simple users, and absolutely no activity except the multisite sync. But it makes sense it has something to do with ssl, since that is the only thing that we do that is not out of the box ceph 13.2.2.

Actions #5

Updated by Matt Benjamin over 5 years ago

  • Status changed from New to In Progress

@dieter, in previous instances of memory growth issues, it has been helpful to get a valgrind massif trace for at least a short time period of operation. Your point about jealloc is valid, but it should be possible to build without jemalloc in recent L, M or master. Maybe that is an option?

Matt

Actions

Also available in: Atom PDF