Backport #14799
closed
hammer: CentOS 7 tcmalloc::ThreadCache valgrind error libboost_thread-mt.so.1.53
Added by Loïc Dachary about 8 years ago.
Updated over 7 years ago.
Description
https://github.com/ceph/ceph/pull/10750
libboost_thread-mt.so.1.53 uses tcmalloc::ThreadCache which triggers a valgrind error. It should probably be suppressed.
http://pulpito.ceph.com/loic-2016-02-16_22:00:52-rados-hammer-backports---basic-multi/12565/
http://qa-proxy.ceph.com/teuthology/loic-2016-02-16_22:00:52-rados-hammer-backports---basic-multi/12565/remote/smithi011/log/valgrind/osd.0.log.gz
<error>
<unique>0x1</unique>
<tid>1</tid>
<kind>SyscallParam</kind>
<what>Syscall param msync(start) points to uninitialised byte(s)</what>
<stack>
<frame>
<ip>0x609A8F0</ip>
<obj>/usr/lib64/libpthread-2.17.so</obj>
<fn>__msync_nocancel</fn>
</frame>
<frame>
<ip>0x78B7F63</ip>
<obj>/usr/lib64/libunwind.so.8.0.1</obj>
</frame>
<frame>
<ip>0x78BAEAE</ip>
<obj>/usr/lib64/libunwind.so.8.0.1</obj>
</frame>
<frame>
<ip>0x78BC181</ip>
<obj>/usr/lib64/libunwind.so.8.0.1</obj>
</frame>
<frame>
<ip>0x78BC518</ip>
<obj>/usr/lib64/libunwind.so.8.0.1</obj>
</frame>
<frame>
<ip>0x78B8900</ip>
<obj>/usr/lib64/libunwind.so.8.0.1</obj>
<fn>_ULx86_64_step</fn>
</frame>
<frame>
<ip>0x58E88CA</ip>
<obj>/usr/lib64/libtcmalloc.so.4.2.6</obj>
</frame>
<frame>
<ip>0x58E90BD</ip>
<obj>/usr/lib64/libtcmalloc.so.4.2.6</obj>
<fn>GetStackTrace(void**, int, int)</fn>
</frame>
<frame>
<ip>0x58DA313</ip>
<obj>/usr/lib64/libtcmalloc.so.4.2.6</obj>
<fn>tcmalloc::PageHeap::GrowHeap(unsigned long)</fn>
</frame>
<frame>
<ip>0x58DA632</ip>
<obj>/usr/lib64/libtcmalloc.so.4.2.6</obj>
<fn>tcmalloc::PageHeap::New(unsigned long)</fn>
</frame>
<frame>
<ip>0x58D8F63</ip>
<obj>/usr/lib64/libtcmalloc.so.4.2.6</obj>
<fn>tcmalloc::CentralFreeList::Populate()</fn>
</frame>
<frame>
<ip>0x58D9147</ip>
<obj>/usr/lib64/libtcmalloc.so.4.2.6</obj>
<fn>tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**)</fn>
</frame>
<frame>
<ip>0x58D91DC</ip>
<obj>/usr/lib64/libtcmalloc.so.4.2.6</obj>
<fn>tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)</fn>
</frame>
<frame>
<ip>0x58DC234</ip>
<obj>/usr/lib64/libtcmalloc.so.4.2.6</obj>
<fn>tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long)</fn>
</frame>
<frame>
<ip>0x58EC7AF</ip>
<obj>/usr/lib64/libtcmalloc.so.4.2.6</obj>
<fn>operator new(unsigned long)</fn>
</frame>
<frame>
<ip>0x66C0A4D</ip>
<obj>/usr/lib64/libboost_thread-mt.so.1.53.0</obj>
<fn>boost::exception_ptr boost::exception_detail::get_static_exception_object<boost::exception_detail::bad_alloc_>()</fn>
</frame>
<frame>
<ip>0x66BC829</ip>
<obj>/usr/lib64/libboost_thread-mt.so.1.53.0</obj>
</frame>
<frame>
<ip>0x400F3A2</ip>
<obj>/usr/lib64/ld-2.17.so</obj>
<fn>_dl_init</fn>
</frame>
<frame>
<ip>0x4001469</ip>
<obj>/usr/lib64/ld-2.17.so</obj>
</frame>
<frame>
<ip>0x3</ip>
</frame>
<frame>
<ip>0xFFF000CC2</ip>
</frame>
<frame>
<ip>0xFFF000CCB</ip>
</frame>
<frame>
<ip>0xFFF000CCE</ip>
</frame>
<frame>
<ip>0xFFF000CD1</ip>
</frame>
</stack>
<auxwhat>Address 0xfff000000 is on thread 1's stack</auxwhat>
</error>
- Project changed from Ceph to sepia
- Subject changed from hammer: saw valgrind issues SyscallParam (CentoOS 7) to hammer: CentOS 7 package notcmalloc is actually compiled with tcmalloc
- Description updated (diff)
- Project changed from sepia to Ceph
- Subject changed from hammer: CentOS 7 package notcmalloc is actually compiled with tcmalloc to hammer: CentOS 7 tcmalloc::ThreadCache valgrind error libboost_thread-mt.so.1.53
- Description updated (diff)
- Status changed from New to 12
- Release set to hammer
- Status changed from 12 to New
- Release deleted (
hammer)
- Status changed from New to 12
- Release set to hammer
My theory is that this error happens because something changed in the packages of centos 7.2. we know that tcmalloc creates false positive in valgrind already. And since it's used by the boost library it's not surprising that we see a valgrind issue. We should check why we did not have that before (but centos 7.2 was added recently so it may not be easy). And check if it also happens on master / jewel runs ? And check if maybe there is a suppression somewhere in these branches that already deal with the issue ? Or maybe there is a package upgrade that fixes it system wide for centos 7.2 ?
- Priority changed from Normal to Urgent
- Assignee set to Loïc Dachary
the problem here is that it didn't use the notcmalloc package .. tcmalloc and valgrind to not mix.
- Status changed from 12 to Can't reproduce
teuthology pulled packages from teh right url, but apparently it build incorrectly because it still linked against tcmalloc.
the gitbuilder.ceph.com build is gone, though, so i can't verify.
mark this can't reproduce until we see it again?
- Status changed from Can't reproduce to 12
- Status changed from 12 to Fix Under Review
- Status changed from Fix Under Review to Resolved
- Related to Bug #15117: hammer: CentOS 7 tcmalloc::ThreadCache valgrind error added
- Status changed from Resolved to Fix Under Review
this happens when osd is dynamically linked against tcmalloc and libboost_thread, and libboost_thread is dynamically linked against libc. guess libtcmalloc was loaded before libc, so the `malloc()` from libc is overridden by the one from libtcmalloc.
@Kefu Chai, it's happening on the notcmalloc jobs, so osd should not linked against tcmalloc.
the osd is not linked against tcmalloc. But it uses a library that is and there really is nothing the compilation process can do to avoid that because there does not exist a version of libboost_thread that is not linked against tcmalloc. I guess it should be reported as a compilation bug to the packager, with a suggestion to only compile with no dependency on tcmalloc.
[ubuntu@smithi011 ~]$ ldd /usr/lib64/libboost_thread-mt.so.1.53.0
linux-vdso.so.1 => (0x00007ffcc139b000)
libboost_system-mt.so.1.53.0 => /lib64/libboost_system-mt.so.1.53.0 (0x00007f51d6118000)
librt.so.1 => /lib64/librt.so.1 (0x00007f51d5f10000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f51d5c07000)
libm.so.6 => /lib64/libm.so.6 (0x00007f51d5905000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f51d56ef000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f51d54d2000)
libc.so.6 => /lib64/libc.so.6 (0x00007f51d5111000)
/lib64/ld-linux-x86-64.so.2 (0x00007f51d6541000)
[ubuntu@smithi011 ~]$ ldd /usr/lib64/libboost_system-mt.so.1.53.0
linux-vdso.so.1 => (0x00007ffe764fb000)
librt.so.1 => /lib64/librt.so.1 (0x00007f5d5bed8000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f5d5bbd0000)
libm.so.6 => /lib64/libm.so.6 (0x00007f5d5b8cd000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f5d5b6b7000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f5d5b49b000)
libc.so.6 => /lib64/libc.so.6 (0x00007f5d5b0d9000)
/lib64/ld-linux-x86-64.so.2 (0x00007f5d5c2f2000)
[ubuntu@smithi011 ~]$ rpm -qR boost-thread
/sbin/ldconfig
/sbin/ldconfig
boost-system(x86-64) = 1.53.0-25.el7
libboost_system-mt.so.1.53.0()(64bit)
libc.so.6()(64bit)
libc.so.6(GLIBC_2.2.5)(64bit)
libc.so.6(GLIBC_2.4)(64bit)
libgcc_s.so.1()(64bit)
libgcc_s.so.1(GCC_3.0)(64bit)
libm.so.6()(64bit)
libpthread.so.0()(64bit)
libpthread.so.0(GLIBC_2.2.5)(64bit)
libpthread.so.0(GLIBC_2.3.2)(64bit)
librt.so.1()(64bit)
librt.so.1(GLIBC_2.2.5)(64bit)
libstdc++.so.6()(64bit)
libstdc++.so.6(CXXABI_1.3)(64bit)
libstdc++.so.6(GLIBCXX_3.4)(64bit)
libstdc++.so.6(GLIBCXX_3.4.9)(64bit)
rpmlib(CompressedFileNames) <= 3.0.4-1
rpmlib(FileDigests) <= 4.6.0-1
rpmlib(PayloadFilesHavePrefix) <= 4.0-1
rtld(GNU_HASH)
rpmlib(PayloadIsXz) <= 5.2-1
badone: smithfarm, loicd: would LD_DEBUG=libs be the answer to find who loads libtcmalloc?
# lsb_release -a
LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.2.1511 (Core)
Release: 7.2.1511
Codename: Core
# rpm -q ceph
ceph-0.94.7-149.g08277b7.x86_64
Pulled from http://gitbuilder.ceph.com/ceph-rpm-centos7-x86_64-notcmalloc/sha1/08277b7bc7c0e533c3fd56a0040dc0ddc74637d6/
# gdb $(which ceph-osd)
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/ceph-osd...Reading symbols from /usr/lib/debug/usr/bin/ceph-osd.debug...done.
done.
(gdb) b main
r
Breakpoint 1 at 0x6471e0: file ceph_osd.cc, line 95.
(gdb) r
Starting program: /usr/bin/ceph-osd
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Breakpoint 1, main (argc=1, argv=0x7fffffffe118) at ceph_osd.cc:95
95 {
Missing separate debuginfos, use: debuginfo-install boost-random-1.53.0-25.el7.x86_64 boost-system-1.53.0-25.el7.x86_64 boost-thread-1.53.0-25.el7.x86_64 bzip2-libs-1.0.6-13.el7.x86_64 glibc-2.17-106.el7_2.8.x86_64 gperftools-libs-2.4-7.el7.x86_64 leveldb-1.12.0-11.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libgcc-4.8.5-4.el7.x86_64 libstdc++-4.8.5-4.el7.x86_64 libunwind-1.1-5.el7_2.2.x86_64 lttng-ust-2.4.1-1.el7.x86_64 nspr-4.11.0-1.el7_2.x86_64 nss-3.21.0-9.el7_2.x86_64 nss-util-3.21.0-2.2.el7_2.x86_64 snappy-1.1.0-3.el7.x86_64 userspace-rcu-0.7.16-1.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) info shared
From To Syms Read Shared Object Library
0x00007ffff7ddbae0 0x00007ffff7df627a Yes (*) /lib64/ld-linux-x86-64.so.2
0x00007ffff7bd95a0 0x00007ffff7bd977d Yes (*) /lib64/libaio.so.1
0x00007ffff79c5170 0x00007ffff79d16f0 Yes (*) /lib64/libz.so.1
0x00007ffff77b4760 0x00007ffff77c05f0 Yes (*) /lib64/libbz2.so.1
0x00007ffff7573310 0x00007ffff75a2b34 Yes (*) /lib64/libleveldb.so.1
0x00007ffff735ad50 0x00007ffff735c704 Yes (*) /lib64/libsnappy.so.1
0x00007ffff70fba40 0x00007ffff711bb3c Yes (*) /lib64/libtcmalloc.so.4 <---------******
0x00007ffff6dd8670 0x00007ffff6ea9fd4 Yes (*) /lib64/libnss3.so
0x00007ffff6b8ed10 0x00007ffff6baec70 Yes (*) /lib64/libnspr4.so
0x00007ffff696b8a0 0x00007ffff6976514 Yes (*) /lib64/libpthread.so.0
0x00007ffff6762ed0 0x00007ffff67639d0 Yes (*) /lib64/libdl.so.2
0x00007ffff65567b0 0x00007ffff655c994 Yes (*) /lib64/libboost_thread-mt.so.1.53.0
0x00007ffff63482d0 0x00007ffff6348e44 Yes (*) /lib64/libboost_system-mt.so.1.53.0
0x00007ffff61451d0 0x00007ffff61459a0 Yes (*) /lib64/libboost_random.so.1.53.0
0x00007ffff5f3e2c0 0x00007ffff5f410bc Yes (*) /lib64/librt.so.1
0x00007ffff5c8f510 0x00007ffff5cf659a Yes (*) /lib64/libstdc++.so.6
0x00007ffff59374b0 0x00007ffff59a19e8 Yes (*) /lib64/libm.so.6
0x00007ffff571eaf0 0x00007ffff572e298 Yes (*) /lib64/libgcc_s.so.1
0x00007ffff53793e0 0x00007ffff54bd670 Yes (*) /lib64/libc.so.6
0x00007ffff51415a0 0x00007ffff51482be Yes (*) /lib64/libunwind.so.8
0x00007ffff4f1fe60 0x00007ffff4f2e6e8 Yes (*) /lib64/libnssutil3.so
0x00007ffff4d10510 0x00007ffff4d11b38 Yes (*) /lib64/libplc4.so
0x00007ffff4b0c090 0x00007ffff4b0d028 Yes (*) /lib64/libplds4.so
0x00007ffff48f2750 0x00007ffff48f8494 Yes (*) /lib64/liblttng-ust-tracepoint.so.0
0x00007ffff46eb040 0x00007ffff46edb7c Yes (*) /lib64/liburcu-bp.so.1
0x00007ffff44e3540 0x00007ffff44e63ac Yes (*) /lib64/liburcu-cds.so.1
0x00007ffff42dfb50 0x00007ffff42e022c Yes (*) /lib64/liburcu-common.so.1
(*): Shared library is missing debugging information.
So we know it is loaded by the time we enter main().
# LD_DEBUG=all LD_DEBUG_OUTPUT=/tmp/ld.out $(which ceph-osd) &>/dev/null
# grep tcmalloc /tmp/ld.out.10359|head -1
10359: file=libtcmalloc.so.4 [0]; needed by /usr/bin/ceph-osd [0]
So that says that libtcmalloc.so.4 is linked into ceph-osd and, indeed, it is.
# ldd /usr/bin/ceph-osd|grep malloc
libtcmalloc.so.4 => /lib64/libtcmalloc.so.4 (0x00007fc52ab61000)
So AFAICT, the build system is linking libtcmalloc despite notcmalloc being specified.
With the help of dmick we've established rpmbuild is not honouring "notcmalloc" and Dan came up with a fix.
I'm in the process of testing a few things and will update once that is done.
Yes, after far too long an investigation, I think the problem is that ceph.spec.in has no interface for passing in the --without-tcmalloc setting.
build-ceph-rpm.sh tries to accomplish this by setting CEPH_EXTRA_CONFIGURE_ARGS in the environment, but ceph.spec.in currently doesn't pay attention to this. Adding it to the invocation of configure does the job, although I'm still interested in the history of how this flag was ever passed.
See 51abff11688f0201b8f4076ac515e4515929d4cb. Never got backported to hammer.
in the gb building the commit posted above,
=== configuring in src/rocksdb (/srv/autobuild-ceph/gitbuilder.git/build/rpmbuild/BUILD/ceph-0.94.7/src/rocksdb)
configure: running /bin/sh ./configure --disable-option-checking '--prefix=/usr' '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--disable-dependency-tracking' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' 'CPPFLAGS= -I/usr/lib/jvm/java/include -I/usr/lib/jvm/java/include/linux' '--localstatedir=/var' '--sysconfdir=/etc' '--docdir=/usr/share/doc/ceph' '--with-nss' '--with-rest-bench' '--with-debug' '--enable-cephfs-java' '--with-librocksdb-static=check' '--without-cryptopp' '--without-tcmalloc' '--with-radosgw' 'CFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic' 'CXXFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'LDFLAGS=-Wl,-z,relro ' 'PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig' --cache-file=/dev/null --srcdir=.
...
checking for malloc in -ltcmalloc... yes
...
see http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-rpm-centos7-amd64-notcmalloc/log.cgi?log=27378d54730a9b844348491cc78df3194f77b65b
we need to disable tcmalloc on rocksdb side also, as it pulls whatever it found in the system.
https://github.com/ceph/rocksdb/pull/11 is posted to address this.
after merging it, we need to have a commit picking up the change in rocksdb submodule and https://github.com/ceph/ceph/commit/51abff11688f0201b8f4076ac515e4515929d4cb
- Tracker changed from Bug to Backport
- Description updated (diff)
- Status changed from Fix Under Review to In Progress
- Status changed from In Progress to Resolved
- Target version set to v0.94.8
Loic, if we include this in 0.94.8, doesn't that mean we need to change the SHA1 in #15895 and restart QE validation?
@Nathan Weinberg you're correct, that deserves a rados run to verify that does not introduce a problem. Given the nature of the change I don't believe it requires more than that.
Also available in: Atom
PDF