Project

General

Profile

Actions

Backport #14799

closed

hammer: CentOS 7 tcmalloc::ThreadCache valgrind error libboost_thread-mt.so.1.53

Added by Loïc Dachary about 8 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Release:
hammer
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

https://github.com/ceph/ceph/pull/10750

libboost_thread-mt.so.1.53 uses tcmalloc::ThreadCache which triggers a valgrind error. It should probably be suppressed.

http://pulpito.ceph.com/loic-2016-02-16_22:00:52-rados-hammer-backports---basic-multi/12565/

http://qa-proxy.ceph.com/teuthology/loic-2016-02-16_22:00:52-rados-hammer-backports---basic-multi/12565/remote/smithi011/log/valgrind/osd.0.log.gz

<error>
  <unique>0x1</unique>
  <tid>1</tid>
  <kind>SyscallParam</kind>
  <what>Syscall param msync(start) points to uninitialised byte(s)</what>
  <stack>
    <frame>
      <ip>0x609A8F0</ip>
      <obj>/usr/lib64/libpthread-2.17.so</obj>
      <fn>__msync_nocancel</fn>
    </frame>
    <frame>
      <ip>0x78B7F63</ip>
      <obj>/usr/lib64/libunwind.so.8.0.1</obj>
    </frame>
    <frame>
      <ip>0x78BAEAE</ip>
      <obj>/usr/lib64/libunwind.so.8.0.1</obj>
    </frame>
    <frame>
      <ip>0x78BC181</ip>
      <obj>/usr/lib64/libunwind.so.8.0.1</obj>
    </frame>
    <frame>
      <ip>0x78BC518</ip>
      <obj>/usr/lib64/libunwind.so.8.0.1</obj>
    </frame>
    <frame>
      <ip>0x78B8900</ip>
      <obj>/usr/lib64/libunwind.so.8.0.1</obj>
      <fn>_ULx86_64_step</fn>
    </frame>
    <frame>
      <ip>0x58E88CA</ip>
      <obj>/usr/lib64/libtcmalloc.so.4.2.6</obj>
    </frame>
    <frame>
      <ip>0x58E90BD</ip>
      <obj>/usr/lib64/libtcmalloc.so.4.2.6</obj>
      <fn>GetStackTrace(void**, int, int)</fn>
    </frame>
    <frame>
      <ip>0x58DA313</ip>
      <obj>/usr/lib64/libtcmalloc.so.4.2.6</obj>
      <fn>tcmalloc::PageHeap::GrowHeap(unsigned long)</fn>
    </frame>
    <frame>
      <ip>0x58DA632</ip>
      <obj>/usr/lib64/libtcmalloc.so.4.2.6</obj>
      <fn>tcmalloc::PageHeap::New(unsigned long)</fn>
    </frame>
    <frame>
      <ip>0x58D8F63</ip>
      <obj>/usr/lib64/libtcmalloc.so.4.2.6</obj>
      <fn>tcmalloc::CentralFreeList::Populate()</fn>
    </frame>
    <frame>
      <ip>0x58D9147</ip>
      <obj>/usr/lib64/libtcmalloc.so.4.2.6</obj>
      <fn>tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**)</fn>
    </frame>
    <frame>
      <ip>0x58D91DC</ip>
      <obj>/usr/lib64/libtcmalloc.so.4.2.6</obj>
      <fn>tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)</fn>
    </frame>
    <frame>
      <ip>0x58DC234</ip>
      <obj>/usr/lib64/libtcmalloc.so.4.2.6</obj>
      <fn>tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long)</fn>
    </frame>
    <frame>
      <ip>0x58EC7AF</ip>
      <obj>/usr/lib64/libtcmalloc.so.4.2.6</obj>
      <fn>operator new(unsigned long)</fn>
    </frame>
    <frame>
      <ip>0x66C0A4D</ip>
      <obj>/usr/lib64/libboost_thread-mt.so.1.53.0</obj>
      <fn>boost::exception_ptr boost::exception_detail::get_static_exception_object&lt;boost::exception_detail::bad_alloc_&gt;()</fn>
    </frame>
    <frame>
      <ip>0x66BC829</ip>
      <obj>/usr/lib64/libboost_thread-mt.so.1.53.0</obj>
    </frame>
    <frame>
      <ip>0x400F3A2</ip>
      <obj>/usr/lib64/ld-2.17.so</obj>
      <fn>_dl_init</fn>
    </frame>
    <frame>
      <ip>0x4001469</ip>
      <obj>/usr/lib64/ld-2.17.so</obj>
    </frame>
    <frame>
      <ip>0x3</ip>
    </frame>
    <frame>
      <ip>0xFFF000CC2</ip>
    </frame>
    <frame>
      <ip>0xFFF000CCB</ip>
    </frame>
    <frame>
      <ip>0xFFF000CCE</ip>
    </frame>
    <frame>
      <ip>0xFFF000CD1</ip>
    </frame>
  </stack>
  <auxwhat>Address 0xfff000000 is on thread 1's stack</auxwhat>
</error>

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #15117: hammer: CentOS 7 tcmalloc::ThreadCache valgrind errorDuplicate03/14/2016

Actions
Actions #2

Updated by Loïc Dachary about 8 years ago

  • Project changed from Ceph to sepia
  • Subject changed from hammer: saw valgrind issues SyscallParam (CentoOS 7) to hammer: CentOS 7 package notcmalloc is actually compiled with tcmalloc
  • Description updated (diff)
Actions #3

Updated by Loïc Dachary about 8 years ago

  • Project changed from sepia to Ceph
  • Subject changed from hammer: CentOS 7 package notcmalloc is actually compiled with tcmalloc to hammer: CentOS 7 tcmalloc::ThreadCache valgrind error libboost_thread-mt.so.1.53
  • Description updated (diff)
Actions #4

Updated by Loïc Dachary about 8 years ago

  • Status changed from New to 12
  • Release set to hammer
Actions #6

Updated by Loïc Dachary about 8 years ago

  • Status changed from New to 12
  • Release set to hammer
Actions #7

Updated by Loïc Dachary about 8 years ago

My theory is that this error happens because something changed in the packages of centos 7.2. we know that tcmalloc creates false positive in valgrind already. And since it's used by the boost library it's not surprising that we see a valgrind issue. We should check why we did not have that before (but centos 7.2 was added recently so it may not be easy). And check if it also happens on master / jewel runs ? And check if maybe there is a suppression somewhere in these branches that already deal with the issue ? Or maybe there is a package upgrade that fixes it system wide for centos 7.2 ?

Actions #10

Updated by Samuel Just about 8 years ago

  • Priority changed from Normal to Urgent
Actions #11

Updated by Loïc Dachary about 8 years ago

  • Assignee set to Loïc Dachary
Actions #12

Updated by Sage Weil about 8 years ago

the problem here is that it didn't use the notcmalloc package .. tcmalloc and valgrind to not mix.

Actions #13

Updated by Sage Weil about 8 years ago

  • Status changed from 12 to Can't reproduce

teuthology pulled packages from teh right url, but apparently it build incorrectly because it still linked against tcmalloc.

the gitbuilder.ceph.com build is gone, though, so i can't verify.

mark this can't reproduce until we see it again?

Actions #14

Updated by Loïc Dachary about 8 years ago

  • Status changed from Can't reproduce to 12

@Sage Weil I think the problem is that libboost itself is linked against tcmalloc. http://pulpito.ceph.com/loic-2016-03-07_21:24:13-fs-hammer-backports---basic-multi/45877/ from a few days ago has the details.

Actions #15

Updated by Loïc Dachary about 8 years ago

  • Status changed from 12 to Fix Under Review
Actions #16

Updated by Sage Weil about 8 years ago

  • Status changed from Fix Under Review to Resolved
Actions #17

Updated by Nathan Cutler almost 8 years ago

  • Related to Bug #15117: hammer: CentOS 7 tcmalloc::ThreadCache valgrind error added
Actions #18

Updated by Loïc Dachary almost 8 years ago

  • Status changed from Resolved to Fix Under Review
Actions #19

Updated by Kefu Chai almost 8 years ago

this happens when osd is dynamically linked against tcmalloc and libboost_thread, and libboost_thread is dynamically linked against libc. guess libtcmalloc was loaded before libc, so the `malloc()` from libc is overridden by the one from libtcmalloc.

Actions #20

Updated by Nathan Cutler almost 8 years ago

@Kefu Chai, it's happening on the notcmalloc jobs, so osd should not linked against tcmalloc.

Actions #21

Updated by Loïc Dachary almost 8 years ago

the osd is not linked against tcmalloc. But it uses a library that is and there really is nothing the compilation process can do to avoid that because there does not exist a version of libboost_thread that is not linked against tcmalloc. I guess it should be reported as a compilation bug to the packager, with a suggestion to only compile with no dependency on tcmalloc.

Actions #22

Updated by Kefu Chai over 7 years ago

[ubuntu@smithi011 ~]$ ldd /usr/lib64/libboost_thread-mt.so.1.53.0
    linux-vdso.so.1 =>  (0x00007ffcc139b000)
    libboost_system-mt.so.1.53.0 => /lib64/libboost_system-mt.so.1.53.0 (0x00007f51d6118000)
    librt.so.1 => /lib64/librt.so.1 (0x00007f51d5f10000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f51d5c07000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f51d5905000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f51d56ef000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f51d54d2000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f51d5111000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f51d6541000)
[ubuntu@smithi011 ~]$ ldd /usr/lib64/libboost_system-mt.so.1.53.0
    linux-vdso.so.1 =>  (0x00007ffe764fb000)
    librt.so.1 => /lib64/librt.so.1 (0x00007f5d5bed8000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f5d5bbd0000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f5d5b8cd000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f5d5b6b7000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f5d5b49b000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f5d5b0d9000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f5d5c2f2000)
[ubuntu@smithi011 ~]$ rpm -qR boost-thread
/sbin/ldconfig
/sbin/ldconfig
boost-system(x86-64) = 1.53.0-25.el7
libboost_system-mt.so.1.53.0()(64bit)
libc.so.6()(64bit)
libc.so.6(GLIBC_2.2.5)(64bit)
libc.so.6(GLIBC_2.4)(64bit)
libgcc_s.so.1()(64bit)
libgcc_s.so.1(GCC_3.0)(64bit)
libm.so.6()(64bit)
libpthread.so.0()(64bit)
libpthread.so.0(GLIBC_2.2.5)(64bit)
libpthread.so.0(GLIBC_2.3.2)(64bit)
librt.so.1()(64bit)
librt.so.1(GLIBC_2.2.5)(64bit)
libstdc++.so.6()(64bit)
libstdc++.so.6(CXXABI_1.3)(64bit)
libstdc++.so.6(GLIBCXX_3.4)(64bit)
libstdc++.so.6(GLIBCXX_3.4.9)(64bit)
rpmlib(CompressedFileNames) <= 3.0.4-1
rpmlib(FileDigests) <= 4.6.0-1
rpmlib(PayloadFilesHavePrefix) <= 4.0-1
rtld(GNU_HASH)
rpmlib(PayloadIsXz) <= 5.2-1
Actions #23

Updated by Nathan Cutler over 7 years ago

badone: smithfarm, loicd: would LD_DEBUG=libs be the answer to find who loads libtcmalloc?
Actions #24

Updated by Brad Hubbard over 7 years ago

# lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.2.1511 (Core) 
Release:        7.2.1511
Codename:       Core

# rpm -q ceph
ceph-0.94.7-149.g08277b7.x86_64

Pulled from http://gitbuilder.ceph.com/ceph-rpm-centos7-x86_64-notcmalloc/sha1/08277b7bc7c0e533c3fd56a0040dc0ddc74637d6/

# gdb $(which ceph-osd)
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying" 
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/ceph-osd...Reading symbols from /usr/lib/debug/usr/bin/ceph-osd.debug...done.
done.
(gdb) b main
r
Breakpoint 1 at 0x6471e0: file ceph_osd.cc, line 95.
(gdb) r
Starting program: /usr/bin/ceph-osd 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Breakpoint 1, main (argc=1, argv=0x7fffffffe118) at ceph_osd.cc:95
95      {
Missing separate debuginfos, use: debuginfo-install boost-random-1.53.0-25.el7.x86_64 boost-system-1.53.0-25.el7.x86_64 boost-thread-1.53.0-25.el7.x86_64 bzip2-libs-1.0.6-13.el7.x86_64 glibc-2.17-106.el7_2.8.x86_64 gperftools-libs-2.4-7.el7.x86_64 leveldb-1.12.0-11.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libgcc-4.8.5-4.el7.x86_64 libstdc++-4.8.5-4.el7.x86_64 libunwind-1.1-5.el7_2.2.x86_64 lttng-ust-2.4.1-1.el7.x86_64 nspr-4.11.0-1.el7_2.x86_64 nss-3.21.0-9.el7_2.x86_64 nss-util-3.21.0-2.2.el7_2.x86_64 snappy-1.1.0-3.el7.x86_64 userspace-rcu-0.7.16-1.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) info shared
From                To                  Syms Read   Shared Object Library
0x00007ffff7ddbae0  0x00007ffff7df627a  Yes (*)     /lib64/ld-linux-x86-64.so.2
0x00007ffff7bd95a0  0x00007ffff7bd977d  Yes (*)     /lib64/libaio.so.1
0x00007ffff79c5170  0x00007ffff79d16f0  Yes (*)     /lib64/libz.so.1
0x00007ffff77b4760  0x00007ffff77c05f0  Yes (*)     /lib64/libbz2.so.1
0x00007ffff7573310  0x00007ffff75a2b34  Yes (*)     /lib64/libleveldb.so.1
0x00007ffff735ad50  0x00007ffff735c704  Yes (*)     /lib64/libsnappy.so.1
0x00007ffff70fba40  0x00007ffff711bb3c  Yes (*)     /lib64/libtcmalloc.so.4 <---------******
0x00007ffff6dd8670  0x00007ffff6ea9fd4  Yes (*)     /lib64/libnss3.so
0x00007ffff6b8ed10  0x00007ffff6baec70  Yes (*)     /lib64/libnspr4.so
0x00007ffff696b8a0  0x00007ffff6976514  Yes (*)     /lib64/libpthread.so.0
0x00007ffff6762ed0  0x00007ffff67639d0  Yes (*)     /lib64/libdl.so.2
0x00007ffff65567b0  0x00007ffff655c994  Yes (*)     /lib64/libboost_thread-mt.so.1.53.0
0x00007ffff63482d0  0x00007ffff6348e44  Yes (*)     /lib64/libboost_system-mt.so.1.53.0
0x00007ffff61451d0  0x00007ffff61459a0  Yes (*)     /lib64/libboost_random.so.1.53.0
0x00007ffff5f3e2c0  0x00007ffff5f410bc  Yes (*)     /lib64/librt.so.1
0x00007ffff5c8f510  0x00007ffff5cf659a  Yes (*)     /lib64/libstdc++.so.6
0x00007ffff59374b0  0x00007ffff59a19e8  Yes (*)     /lib64/libm.so.6
0x00007ffff571eaf0  0x00007ffff572e298  Yes (*)     /lib64/libgcc_s.so.1
0x00007ffff53793e0  0x00007ffff54bd670  Yes (*)     /lib64/libc.so.6
0x00007ffff51415a0  0x00007ffff51482be  Yes (*)     /lib64/libunwind.so.8
0x00007ffff4f1fe60  0x00007ffff4f2e6e8  Yes (*)     /lib64/libnssutil3.so
0x00007ffff4d10510  0x00007ffff4d11b38  Yes (*)     /lib64/libplc4.so
0x00007ffff4b0c090  0x00007ffff4b0d028  Yes (*)     /lib64/libplds4.so
0x00007ffff48f2750  0x00007ffff48f8494  Yes (*)     /lib64/liblttng-ust-tracepoint.so.0
0x00007ffff46eb040  0x00007ffff46edb7c  Yes (*)     /lib64/liburcu-bp.so.1
0x00007ffff44e3540  0x00007ffff44e63ac  Yes (*)     /lib64/liburcu-cds.so.1
0x00007ffff42dfb50  0x00007ffff42e022c  Yes (*)     /lib64/liburcu-common.so.1
(*): Shared library is missing debugging information.

So we know it is loaded by the time we enter main().

# LD_DEBUG=all LD_DEBUG_OUTPUT=/tmp/ld.out $(which ceph-osd) &>/dev/null
# grep tcmalloc /tmp/ld.out.10359|head -1
     10359:     file=libtcmalloc.so.4 [0];  needed by /usr/bin/ceph-osd [0]

So that says that libtcmalloc.so.4 is linked into ceph-osd and, indeed, it is.

# ldd /usr/bin/ceph-osd|grep malloc
        libtcmalloc.so.4 => /lib64/libtcmalloc.so.4 (0x00007fc52ab61000)

So AFAICT, the build system is linking libtcmalloc despite notcmalloc being specified.
Actions #25

Updated by Brad Hubbard over 7 years ago

With the help of dmick we've established rpmbuild is not honouring "notcmalloc" and Dan came up with a fix.

I'm in the process of testing a few things and will update once that is done.

Actions #26

Updated by Dan Mick over 7 years ago

Yes, after far too long an investigation, I think the problem is that ceph.spec.in has no interface for passing in the --without-tcmalloc setting.

build-ceph-rpm.sh tries to accomplish this by setting CEPH_EXTRA_CONFIGURE_ARGS in the environment, but ceph.spec.in currently doesn't pay attention to this. Adding it to the invocation of configure does the job, although I'm still interested in the history of how this flag was ever passed.

Actions #27

Updated by Dan Mick over 7 years ago

See 51abff11688f0201b8f4076ac515e4515929d4cb. Never got backported to hammer.

Actions #28

Updated by Brad Hubbard over 7 years ago

https://github.com/ceph/ceph/commit/51abff11688f0201b8f4076ac515e4515929d4cb

I ran a gitbuilder with 51abff11688f0201b8f4076ac515e4515929d4cb but it still seems to link to libtcmalloc, still investigating.

Actions #29

Updated by Kefu Chai over 7 years ago

in the gb building the commit posted above,

=== configuring in src/rocksdb (/srv/autobuild-ceph/gitbuilder.git/build/rpmbuild/BUILD/ceph-0.94.7/src/rocksdb)
configure: running /bin/sh ./configure --disable-option-checking '--prefix=/usr' '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--disable-dependency-tracking' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' 'CPPFLAGS= -I/usr/lib/jvm/java/include -I/usr/lib/jvm/java/include/linux' '--localstatedir=/var' '--sysconfdir=/etc' '--docdir=/usr/share/doc/ceph' '--with-nss' '--with-rest-bench' '--with-debug' '--enable-cephfs-java' '--with-librocksdb-static=check' '--without-cryptopp' '--without-tcmalloc' '--with-radosgw' 'CFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic' 'CXXFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'LDFLAGS=-Wl,-z,relro ' 'PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig' --cache-file=/dev/null --srcdir=.
...
checking for malloc in -ltcmalloc... yes
...

see http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-rpm-centos7-amd64-notcmalloc/log.cgi?log=27378d54730a9b844348491cc78df3194f77b65b

we need to disable tcmalloc on rocksdb side also, as it pulls whatever it found in the system.

https://github.com/ceph/rocksdb/pull/11 is posted to address this.

after merging it, we need to have a commit picking up the change in rocksdb submodule and https://github.com/ceph/ceph/commit/51abff11688f0201b8f4076ac515e4515929d4cb

Actions #31

Updated by Loïc Dachary over 7 years ago

  • Tracker changed from Bug to Backport
Actions #32

Updated by Loïc Dachary over 7 years ago

  • Description updated (diff)
Actions #33

Updated by Loïc Dachary over 7 years ago

  • Status changed from Fix Under Review to In Progress
Actions #34

Updated by Loïc Dachary over 7 years ago

  • Status changed from In Progress to Resolved
  • Target version set to v0.94.8
Actions #35

Updated by Nathan Cutler over 7 years ago

Loic, if we include this in 0.94.8, doesn't that mean we need to change the SHA1 in #15895 and restart QE validation?

Actions #36

Updated by Loïc Dachary over 7 years ago

@Nathan Weinberg you're correct, that deserves a rados run to verify that does not introduce a problem. Given the nature of the change I don't believe it requires more than that.

Actions

Also available in: Atom PDF