Project

General

Profile

Actions

Bug #17108

closed

CephContext memory leaks after global_init_daemonize()

Added by Casey Bodley over 7 years ago. Updated almost 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We've been seeing valgrind failures in radosgw from all of our recent teuthology runs. An example of the valgrind output points to a std::string in the md_config_t coming from CephContext: http://qa-proxy.ceph.com/teuthology/teuthology-2016-08-20_17:05:04-rgw-master---basic-smithi/376190/remote/smithi046/log/valgrind/client.0.log.gz

I've narrowed it down to a trivial test case that calls global_init(), common_init_finish(), global_init_daemonize(), and g_ceph_context->put(): https://gist.github.com/cbodley/4551b29c50718c230683a6c1d65b326a

Running this test with the -f flag (to set daemonize=false), no leaks are detected:

$ valgrind --tool=memcheck --leak-check=full bin/cephcontext_test -c ceph.conf -f
==18331== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Without the -f, valgrind complains about the CephContext leaks, but the test doesn't terminate like I'd expect.

$ valgrind --tool=memcheck --leak-check=full bin/cephcontext_test -c ceph.conf
...
==18335== ERROR SUMMARY: 105 errors from 105 contexts (suppressed: 0 from 0)

$ ps ax | grep valgrind
18339 ?        Ssl    0:00 valgrind --tool=memcheck --leak-check=full bin/cephcontext_test -c ceph.conf
18354 pts/1    S+     0:00 grep --color=auto valgrind

Killing the process prints the following message, followed by all of the same leaks:

$ kill 18339
==18339==
==18339== Process terminating with default action of signal 15 (SIGTERM)
==18339==    at 0x9B109E8: pthread_cond_destroy@@GLIBC_2.3.2 (pthread_cond_destroy.c:77)
==18339==    by 0x66DC23: Cond::~Cond() (Cond.h:45)
==18339==    by 0x66E22D: CephContextServiceThread::~CephContextServiceThread() (ceph_context.cc:90)
==18339==    by 0x66E279: CephContextServiceThread::~CephContextServiceThread() (ceph_context.cc:90)
==18339==    by 0x66CF5D: CephContext::join_service_thread() (ceph_context.cc:652)
==18339==    by 0x66C1A1: CephContext::~CephContext() (ceph_context.cc:522)
==18339==    by 0x66CC59: CephContext::put() (ceph_context.cc:594)
==18339==    by 0x64ACC6: main (test_main.cc:25)
...
==18339== ERROR SUMMARY: 104 errors from 104 contexts (suppressed: 0 from 0)

So we see the CephContext destructor being called, but it hangs on pthread_cond_destroy(). Looking to helgrind for help:

$ valgrind --tool=helgrind bin/cephcontext_test -c ceph.conf
==18362== ---Thread-Announcement------------------------------------------
==18362==
==18362== Thread #1 is the program's root thread
==18362==
==18362== ----------------------------------------------------------------
==18362==
==18362== Thread #1: pthread_cond_destroy: destruction of condition variable being waited upon
==18362==    at 0x98FC915: pthread_cond_destroy_WRK (hg_intercepts.c:1586)
==18362==    by 0x98FFB93: pthread_cond_destroy@* (hg_intercepts.c:1604)
==18362==    by 0x66DC23: Cond::~Cond() (Cond.h:45)
==18362==    by 0x66E22D: CephContextServiceThread::~CephContextServiceThread() (ceph_context.cc:90)
==18362==    by 0x66E279: CephContextServiceThread::~CephContextServiceThread() (ceph_context.cc:90)
==18362==    by 0x66CF5D: CephContext::join_service_thread() (ceph_context.cc:652)
==18362==    by 0x66C1A1: CephContext::~CephContext() (ceph_context.cc:522)
==18362==    by 0x66CC59: CephContext::put() (ceph_context.cc:594)
==18362==    by 0x64ACC6: main (test_main.cc:25)

At the point when CephContext's destructor fires, only two threads remain: the main thread, and the log thread - and the log thread is waiting on its own condition variable. So it appears that the conflicting waiter is actually the CephContextServiceThread from the parent process.

Actions #1

Updated by Samuel Just over 7 years ago

  • Priority changed from High to Normal
Actions #2

Updated by Sage Weil almost 3 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF