Project

General

Profile

Bug #23653

tcmalloc Attempt to free invalid pointer 0x55de11f2a540 in rocksdb::LRUCache::~LRUCache during mkfs->_open_db

Added by Yuri Weinstein over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Target version:
-
Start date:
04/11/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
smoke
Pull request ID:

Description

This is on rhel7.5 qst run

Run: http://pulpito.ceph.com/teuthology-2018-04-10_20:02:32-smoke-master-testing-basic-smithi/
Jobs: '2381353', '2381344', '2381333', '2381351', '2381334', '2381336', '2381331', '2381352', '2381355', '2381340', '2381335', '2381339', '2381337', '2381354', '2381350', '2381345', '2381348', '2381332', '2381347', '2381341', '2381338', '2381329', '2381342', '2381349', '2381346'

Logs:

2018-04-10T20:58:59.441 INFO:teuthology.orchestra.run.smithi011.stderr:src/tcmalloc.cc:284] Attempt to free invalid pointer 0x55de11f2a540
2018-04-10T20:58:59.442 INFO:teuthology.orchestra.run.smithi011.stderr:*** Caught signal (Aborted) **
2018-04-10T20:58:59.442 INFO:teuthology.orchestra.run.smithi011.stderr: in thread 7f2c8e12f0c0 thread_name:ceph-osd
2018-04-10T20:58:59.444 INFO:teuthology.orchestra.run.smithi011.stderr: ceph version 13.0.2-918-gbd0c68e (bd0c68e085a84d0c972925d2992ef4fb5a2d6e5f) mimic (dev)
2018-04-10T20:58:59.444 INFO:teuthology.orchestra.run.smithi011.stderr: 1: (()+0x8e84d0) [0x55de0f7324d0]
2018-04-10T20:58:59.444 INFO:teuthology.orchestra.run.smithi011.stderr: 2: (()+0xf680) [0x7f2c829c2680]
2018-04-10T20:58:59.444 INFO:teuthology.orchestra.run.smithi011.stderr: 3: (gsignal()+0x37) [0x7f2c819e3207]
2018-04-10T20:58:59.444 INFO:teuthology.orchestra.run.smithi011.stderr: 4: (abort()+0x148) [0x7f2c819e48f8]
2018-04-10T20:58:59.445 INFO:teuthology.orchestra.run.smithi011.stderr: 5: (tcmalloc::Log(tcmalloc::LogMode, char const*, int, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem)+0x1e6) [0x7f2c840288d6]
2018-04-10T20:58:59.445 INFO:teuthology.orchestra.run.smithi011.stderr: 6: (()+0x174b4) [0x7f2c8401d4b4]
2018-04-10T20:58:59.445 INFO:teuthology.orchestra.run.smithi011.stderr: 7: (rocksdb::LRUCache::~LRUCache()+0x65) [0x55de0f75c4a5]
2018-04-10T20:58:59.445 INFO:teuthology.orchestra.run.smithi011.stderr: 8: (std::_Sp_counted_ptr<rocksdb::BlockBasedTableFactory*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x25a) [0x55de0f83f69a]
2018-04-10T20:58:59.445 INFO:teuthology.orchestra.run.smithi011.stderr: 9: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x46) [0x55de0f241356]
2018-04-10T20:58:59.445 INFO:teuthology.orchestra.run.smithi011.stderr: 10: (rocksdb::ColumnFamilyOptions::~ColumnFamilyOptions()+0x1e) [0x55de0f68fd6e]
2018-04-10T20:58:59.445 INFO:teuthology.orchestra.run.smithi011.stderr: 11: (RocksDBStore::init(std::string)+0x75) [0x55de0f686865]
2018-04-10T20:58:59.445 INFO:teuthology.orchestra.run.smithi011.stderr: 12: (BlueStore::_open_db(bool, bool)+0xe48) [0x55de0f61c2e8]
2018-04-10T20:58:59.445 INFO:teuthology.orchestra.run.smithi011.stderr: 13: (BlueStore::mkfs()+0x699) [0x55de0f649fa9]
2018-04-10T20:58:59.446 INFO:teuthology.orchestra.run.smithi011.stderr: 14: (OSD::mkfs(CephContext*, ObjectStore*, std::string const&, uuid_d, int)+0x177) [0x55de0f219447]
2018-04-10T20:58:59.446 INFO:teuthology.orchestra.run.smithi011.stderr: 15: (main()+0x2adc) [0x55de0f0ed40c]
2018-04-10T20:58:59.446 INFO:teuthology.orchestra.run.smithi011.stderr: 16: (__libc_start_main()+0xf5) [0x7f2c819cf3d5]
2018-04-10T20:58:59.454 INFO:teuthology.orchestra.run.smithi011.stderr: 17: (()+0x3830d0) [0x55de0f1cd0d0]
2018-04-10T20:58:59.454 INFO:teuthology.orchestra.run.smithi011.stderr:2018-04-10 20:58:59.442 7f2c8e12f0c0 -1 *** Caught signal (Aborted) **
2018-04-10T20:58:59.454 INFO:teuthology.orchestra.run.smithi011.stderr: in thread 7f2c8e12f0c0 thread_name:ceph-osd
2018-04-10T20:58:59.454 INFO:teuthology.orchestra.run.smithi011.stderr:
2018-04-10T20:58:59.454 INFO:teuthology.orchestra.run.smithi011.stderr: ceph version 13.0.2-918-gbd0c68e (bd0c68e085a84d0c972925d2992ef4fb5a2d6e5f) mimic (dev)
2018-04-10T20:58:59.454 INFO:teuthology.orchestra.run.smithi011.stderr: 1: (()+0x8e84d0) [0x55de0f7324d0]
2018-04-10T20:58:59.455 INFO:teuthology.orchestra.run.smithi011.stderr: 2: (()+0xf680) [0x7f2c829c2680]
2018-04-10T20:58:59.455 INFO:teuthology.orchestra.run.smithi011.stderr: 3: (gsignal()+0x37) [0x7f2c819e3207]
2018-04-10T20:58:59.455 INFO:teuthology.orchestra.run.smithi011.stderr: 4: (abort()+0x148) [0x7f2c819e48f8]
2018-04-10T20:58:59.455 INFO:teuthology.orchestra.run.smithi011.stderr: 5: (tcmalloc::Log(tcmalloc::LogMode, char const*, int, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem)+0x1e6) [0x7f2c840288d6]
2018-04-10T20:58:59.455 INFO:teuthology.orchestra.run.smithi011.stderr: 6: (()+0x174b4) [0x7f2c8401d4b4]
2018-04-10T20:58:59.456 INFO:teuthology.orchestra.run.smithi011.stderr: 7: (rocksdb::LRUCache::~LRUCache()+0x65) [0x55de0f75c4a5]
2018-04-10T20:58:59.456 INFO:teuthology.orchestra.run.smithi011.stderr: 8: (std::_Sp_counted_ptr<rocksdb::BlockBasedTableFactory*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x25a) [0x55de0f83f69a]
2018-04-10T20:58:59.456 INFO:teuthology.orchestra.run.smithi011.stderr: 9: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x46) [0x55de0f241356]
2018-04-10T20:58:59.456 INFO:teuthology.orchestra.run.smithi011.stderr: 10: (rocksdb::ColumnFamilyOptions::~ColumnFamilyOptions()+0x1e) [0x55de0f68fd6e]
2018-04-10T20:58:59.456 INFO:teuthology.orchestra.run.smithi011.stderr: 11: (RocksDBStore::init(std::string)+0x75) [0x55de0f686865]
2018-04-10T20:58:59.456 INFO:teuthology.orchestra.run.smithi011.stderr: 12: (BlueStore::_open_db(bool, bool)+0xe48) [0x55de0f61c2e8]
2018-04-10T20:58:59.456 INFO:teuthology.orchestra.run.smithi011.stderr: 13: (BlueStore::mkfs()+0x699) [0x55de0f649fa9]
2018-04-10T20:58:59.456 INFO:teuthology.orchestra.run.smithi011.stderr: 14: (OSD::mkfs(CephContext*, ObjectStore*, std::string const&, uuid_d, int)+0x177) [0x55de0f219447]
2018-04-10T20:58:59.456 INFO:teuthology.orchestra.run.smithi011.stderr: 15: (main()+0x2adc) [0x55de0f0ed40c]
2018-04-10T20:58:59.456 INFO:teuthology.orchestra.run.smithi011.stderr: 16: (__libc_start_main()+0xf5) [0x7f2c819cf3d5]
2018-04-10T20:58:59.457 INFO:teuthology.orchestra.run.smithi011.stderr: 17: (()+0x3830d0) [0x55de0f1cd0d0]
2018-04-10T20:58:59.457 INFO:teuthology.orchestra.run.smithi011.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Related issues

Related to Ceph - Bug #21422: crash in rocksdb LRUCache destructor with tcmalloc v4.2.6 / gperf-tools v2.5.93 Resolved 09/18/2017
Related to RADOS - Bug #35969: "symbol lookup error: ceph-osd: undefined symbol: _ZdaPvm" on centos 7.4 Resolved 09/13/2018
Copied to bluestore - Backport #24154: mimic: tcmalloc Attempt to free invalid pointer 0x55de11f2a540 in rocksdb::LRUCache::~LRUCache during mkfs->_open_db Resolved

History

#1 Updated by Sage Weil over 1 year ago

  • Project changed from Ceph to bluestore
  • Subject changed from "Caught signal" in smoke on rhel7.5 to tcmalloc Attempt to free invalid pointer 0x55de11f2a540 in rocksdb::LRUCache::~LRUCache during mkfs->_open_db

#2 Updated by Sage Weil over 1 year ago

This looks to me like a build issue with tcmalloc... specifically, building in centos and running in rhel. Running on rhel with a notmcalloc build is fine.

Also, notably, simply running ceph-mon with no arguments, which exits out of main() before doing almost anything at all, results in the tcmalloc message and segfault. This suggests that something is happening in a static singleton definition that is going wrong. My guess is a rocksdb singleton that (incorrect) does something with tcmalloc and fails due to an incompatible ABI?

#3 Updated by Sage Weil over 1 year ago

on lab centos deploy,

[sage@smithi099 ~]$ lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.4.1708 (Core) 
Release:        7.4.1708
Codename:       Core
[sage@smithi099 ~]$ rpm -qf /lib64/libtcmalloc.so.4
gperftools-libs-2.4-8.el7.x86_64

vs rhel
[sage@smithi095 ~]$ lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 7.4 (Maipo)
Release:        7.4
Codename:       Maipo
[sage@smithi095 ~]$ rpm -qf /lib64/libtcmalloc.so.4
gperftools-libs-2.4-8.el7.x86_64

#4 Updated by Sage Weil over 1 year ago

except the job runs on rhel 7.5,

[sage@smithi116 ~]$ lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 7.5 (Maipo)
Release:        7.5
Codename:       Maipo
[sage@smithi116 ~]$ rpm -qf /lib64/libtcmalloc.so.4
gperftools-libs-2.6.1-1.el7.x86_64

but there is no centos 7.5 image.

#5 Updated by Sage Weil over 1 year ago

  • Priority changed from Urgent to Immediate

/a/sage-2018-04-18_19:08:00-rados-wip-sage-testing-2018-04-18-1210-distro-basic-smithi/2413082

#6 Updated by Kefu Chai over 1 year ago

  • Status changed from New to Duplicate

tcmalloc 2.6.1 is buggy. probably we need runtime check to disallow ceph to run with "2.5 < tcmalloc.version < 2.6.2" in debian/control or ceph.spec.

or better off, applying the patch in https://bugzilla.redhat.com/show_bug.cgi?id=1494309 to gperftools-libs-2.6.1-1.el7.x86_64 shipped with RHEL 7.5

#7 Updated by Kefu Chai over 1 year ago

i just filed https://bugzilla.redhat.com/show_bug.cgi?id=1569391 to track this issue at downstream.

#8 Updated by Kefu Chai over 1 year ago

  • Related to Bug #21422: crash in rocksdb LRUCache destructor with tcmalloc v4.2.6 / gperf-tools v2.5.93 added

#9 Updated by Kefu Chai over 1 year ago

  • Status changed from Duplicate to Verified

change the status to verified. because, unlike #21422, this issue more of a run-time dependency problem.

#10 Updated by Kefu Chai over 1 year ago

i was thinking about statically linking against tcmalloc, but seems it's a dead-end.

see https://sourceware.org/bugzilla/show_bug.cgi?id=20432. and the glibc bug was fixed in 2.25, but RHEL/centos 7.4 comes with glibc v2.17. so we cannot link tcmalloc statically on RHEL/centos safely.

currently ceph pulls the gperftools-libs by depending on libtcmalloc.so.4. we could "Requires" gperftools-libs explicitly. but rpm's spec does not allow something like

Requires: gperftools-libs != 2.6.1-1

because <= 2.5, and >= 2.6.1-5 do not have this issue.

#11 Updated by Kefu Chai over 1 year ago

or we can notes this down as a known issue on RHEL7.5 and gperftools-libs 2.6.1-1.

#12 Updated by Josh Durgin over 1 year ago

  • Assignee set to Kefu Chai

Discussed on irc, it appears we can work around this by replacing the single aligned_alloc() call in rocksdb with posix_memalign(), which we already use in bufferlist.

#13 Updated by Kefu Chai over 1 year ago

  • Status changed from Verified to Need Review

#14 Updated by Kefu Chai over 1 year ago

  • Status changed from Need Review to Resolved

#15 Updated by Kefu Chai over 1 year ago

  • Status changed from Resolved to Verified

we are now using centos 7.5 for building rpm. so we should drop this change in cmake.

Getting requirements for /tmp/install-deps.11183/ceph.spec
 --> 1:java-1.8.0-openjdk-devel-1.8.0.171-7.b10.el7.x86_64
 --> sharutils-4.13.3-8.el7.x86_64
 --> Already installed : checkpolicy-2.5-4.el7.x86_64
 --> selinux-policy-devel-3.13.1-192.el7_5.3.noarch
 --> Already installed : bc-1.06.95-13.el7.x86_64
 --> gperf-3.0.4-8.el7.x86_64
 --> Already installed : cmake-2.8.12.2-2.el7.x86_64
 --> cryptsetup-1.7.4-4.el7.x86_64
 --> fuse-devel-2.9.2-10.el7.x86_64
 --> devtoolset-7-gcc-c++-7.2.1-1.el7.sc1.x86_64
 --> Already installed : gdbm-1.10-8.el7.x86_64
 --> gperftools-devel-2.6.1-1.el7.x86_64

CMake Error at cmake/modules/BuildRocksDB.cmake:64 (message):
  Incompatible tcmalloc v2.6.1 and rocksdb v5.13.0, please install
  gperf-tools 2.5 (not 2.5.93) or >= 2.6.2, or switch to another allocator
  using 'cmake -DALLOCATOR=libc'.
Call Stack (most recent call first):
  cmake/modules/BuildRocksDB.cmake:94 (check_aligned_alloc)
  src/CMakeLists.txt:860 (build_rocksdb)

#16 Updated by Kefu Chai over 1 year ago

https://github.com/ceph/ceph/pull/22046 to drop the check for tcmalloc

https://github.com/facebook/rocksdb/pull/3862 is posted to address the issue on rocksdb side.

#17 Updated by Kefu Chai over 1 year ago

  • Status changed from Verified to Need Review

#18 Updated by Kefu Chai over 1 year ago

  • Status changed from Need Review to Resolved

#19 Updated by Kefu Chai over 1 year ago

  • Copied to Backport #24154: mimic: tcmalloc Attempt to free invalid pointer 0x55de11f2a540 in rocksdb::LRUCache::~LRUCache during mkfs->_open_db added

#20 Updated by Kefu Chai about 1 year ago

  • Backport set to mimic

#21 Updated by Kefu Chai 11 months ago

  • Related to Bug #35969: "symbol lookup error: ceph-osd: undefined symbol: _ZdaPvm" on centos 7.4 added

Also available in: Atom PDF