Bug #23653
closedtcmalloc Attempt to free invalid pointer 0x55de11f2a540 in rocksdb::LRUCache::~LRUCache during mkfs->_open_db
0%
Description
This is on rhel7.5 qst run
Run: http://pulpito.ceph.com/teuthology-2018-04-10_20:02:32-smoke-master-testing-basic-smithi/
Jobs: '2381353', '2381344', '2381333', '2381351', '2381334', '2381336', '2381331', '2381352', '2381355', '2381340', '2381335', '2381339', '2381337', '2381354', '2381350', '2381345', '2381348', '2381332', '2381347', '2381341', '2381338', '2381329', '2381342', '2381349', '2381346'
Logs:
2018-04-10T20:58:59.441 INFO:teuthology.orchestra.run.smithi011.stderr:src/tcmalloc.cc:284] Attempt to free invalid pointer 0x55de11f2a540 2018-04-10T20:58:59.442 INFO:teuthology.orchestra.run.smithi011.stderr:*** Caught signal (Aborted) ** 2018-04-10T20:58:59.442 INFO:teuthology.orchestra.run.smithi011.stderr: in thread 7f2c8e12f0c0 thread_name:ceph-osd 2018-04-10T20:58:59.444 INFO:teuthology.orchestra.run.smithi011.stderr: ceph version 13.0.2-918-gbd0c68e (bd0c68e085a84d0c972925d2992ef4fb5a2d6e5f) mimic (dev) 2018-04-10T20:58:59.444 INFO:teuthology.orchestra.run.smithi011.stderr: 1: (()+0x8e84d0) [0x55de0f7324d0] 2018-04-10T20:58:59.444 INFO:teuthology.orchestra.run.smithi011.stderr: 2: (()+0xf680) [0x7f2c829c2680] 2018-04-10T20:58:59.444 INFO:teuthology.orchestra.run.smithi011.stderr: 3: (gsignal()+0x37) [0x7f2c819e3207] 2018-04-10T20:58:59.444 INFO:teuthology.orchestra.run.smithi011.stderr: 4: (abort()+0x148) [0x7f2c819e48f8] 2018-04-10T20:58:59.445 INFO:teuthology.orchestra.run.smithi011.stderr: 5: (tcmalloc::Log(tcmalloc::LogMode, char const*, int, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem)+0x1e6) [0x7f2c840288d6] 2018-04-10T20:58:59.445 INFO:teuthology.orchestra.run.smithi011.stderr: 6: (()+0x174b4) [0x7f2c8401d4b4] 2018-04-10T20:58:59.445 INFO:teuthology.orchestra.run.smithi011.stderr: 7: (rocksdb::LRUCache::~LRUCache()+0x65) [0x55de0f75c4a5] 2018-04-10T20:58:59.445 INFO:teuthology.orchestra.run.smithi011.stderr: 8: (std::_Sp_counted_ptr<rocksdb::BlockBasedTableFactory*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x25a) [0x55de0f83f69a] 2018-04-10T20:58:59.445 INFO:teuthology.orchestra.run.smithi011.stderr: 9: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x46) [0x55de0f241356] 2018-04-10T20:58:59.445 INFO:teuthology.orchestra.run.smithi011.stderr: 10: (rocksdb::ColumnFamilyOptions::~ColumnFamilyOptions()+0x1e) [0x55de0f68fd6e] 2018-04-10T20:58:59.445 INFO:teuthology.orchestra.run.smithi011.stderr: 11: (RocksDBStore::init(std::string)+0x75) [0x55de0f686865] 2018-04-10T20:58:59.445 INFO:teuthology.orchestra.run.smithi011.stderr: 12: (BlueStore::_open_db(bool, bool)+0xe48) [0x55de0f61c2e8] 2018-04-10T20:58:59.445 INFO:teuthology.orchestra.run.smithi011.stderr: 13: (BlueStore::mkfs()+0x699) [0x55de0f649fa9] 2018-04-10T20:58:59.446 INFO:teuthology.orchestra.run.smithi011.stderr: 14: (OSD::mkfs(CephContext*, ObjectStore*, std::string const&, uuid_d, int)+0x177) [0x55de0f219447] 2018-04-10T20:58:59.446 INFO:teuthology.orchestra.run.smithi011.stderr: 15: (main()+0x2adc) [0x55de0f0ed40c] 2018-04-10T20:58:59.446 INFO:teuthology.orchestra.run.smithi011.stderr: 16: (__libc_start_main()+0xf5) [0x7f2c819cf3d5] 2018-04-10T20:58:59.454 INFO:teuthology.orchestra.run.smithi011.stderr: 17: (()+0x3830d0) [0x55de0f1cd0d0] 2018-04-10T20:58:59.454 INFO:teuthology.orchestra.run.smithi011.stderr:2018-04-10 20:58:59.442 7f2c8e12f0c0 -1 *** Caught signal (Aborted) ** 2018-04-10T20:58:59.454 INFO:teuthology.orchestra.run.smithi011.stderr: in thread 7f2c8e12f0c0 thread_name:ceph-osd 2018-04-10T20:58:59.454 INFO:teuthology.orchestra.run.smithi011.stderr: 2018-04-10T20:58:59.454 INFO:teuthology.orchestra.run.smithi011.stderr: ceph version 13.0.2-918-gbd0c68e (bd0c68e085a84d0c972925d2992ef4fb5a2d6e5f) mimic (dev) 2018-04-10T20:58:59.454 INFO:teuthology.orchestra.run.smithi011.stderr: 1: (()+0x8e84d0) [0x55de0f7324d0] 2018-04-10T20:58:59.455 INFO:teuthology.orchestra.run.smithi011.stderr: 2: (()+0xf680) [0x7f2c829c2680] 2018-04-10T20:58:59.455 INFO:teuthology.orchestra.run.smithi011.stderr: 3: (gsignal()+0x37) [0x7f2c819e3207] 2018-04-10T20:58:59.455 INFO:teuthology.orchestra.run.smithi011.stderr: 4: (abort()+0x148) [0x7f2c819e48f8] 2018-04-10T20:58:59.455 INFO:teuthology.orchestra.run.smithi011.stderr: 5: (tcmalloc::Log(tcmalloc::LogMode, char const*, int, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem)+0x1e6) [0x7f2c840288d6] 2018-04-10T20:58:59.455 INFO:teuthology.orchestra.run.smithi011.stderr: 6: (()+0x174b4) [0x7f2c8401d4b4] 2018-04-10T20:58:59.456 INFO:teuthology.orchestra.run.smithi011.stderr: 7: (rocksdb::LRUCache::~LRUCache()+0x65) [0x55de0f75c4a5] 2018-04-10T20:58:59.456 INFO:teuthology.orchestra.run.smithi011.stderr: 8: (std::_Sp_counted_ptr<rocksdb::BlockBasedTableFactory*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x25a) [0x55de0f83f69a] 2018-04-10T20:58:59.456 INFO:teuthology.orchestra.run.smithi011.stderr: 9: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x46) [0x55de0f241356] 2018-04-10T20:58:59.456 INFO:teuthology.orchestra.run.smithi011.stderr: 10: (rocksdb::ColumnFamilyOptions::~ColumnFamilyOptions()+0x1e) [0x55de0f68fd6e] 2018-04-10T20:58:59.456 INFO:teuthology.orchestra.run.smithi011.stderr: 11: (RocksDBStore::init(std::string)+0x75) [0x55de0f686865] 2018-04-10T20:58:59.456 INFO:teuthology.orchestra.run.smithi011.stderr: 12: (BlueStore::_open_db(bool, bool)+0xe48) [0x55de0f61c2e8] 2018-04-10T20:58:59.456 INFO:teuthology.orchestra.run.smithi011.stderr: 13: (BlueStore::mkfs()+0x699) [0x55de0f649fa9] 2018-04-10T20:58:59.456 INFO:teuthology.orchestra.run.smithi011.stderr: 14: (OSD::mkfs(CephContext*, ObjectStore*, std::string const&, uuid_d, int)+0x177) [0x55de0f219447] 2018-04-10T20:58:59.456 INFO:teuthology.orchestra.run.smithi011.stderr: 15: (main()+0x2adc) [0x55de0f0ed40c] 2018-04-10T20:58:59.456 INFO:teuthology.orchestra.run.smithi011.stderr: 16: (__libc_start_main()+0xf5) [0x7f2c819cf3d5] 2018-04-10T20:58:59.457 INFO:teuthology.orchestra.run.smithi011.stderr: 17: (()+0x3830d0) [0x55de0f1cd0d0] 2018-04-10T20:58:59.457 INFO:teuthology.orchestra.run.smithi011.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Sage Weil about 6 years ago
- Project changed from Ceph to bluestore
- Subject changed from "Caught signal" in smoke on rhel7.5 to tcmalloc Attempt to free invalid pointer 0x55de11f2a540 in rocksdb::LRUCache::~LRUCache during mkfs->_open_db
Updated by Sage Weil about 6 years ago
This looks to me like a build issue with tcmalloc... specifically, building in centos and running in rhel. Running on rhel with a notmcalloc build is fine.
Also, notably, simply running ceph-mon with no arguments, which exits out of main() before doing almost anything at all, results in the tcmalloc message and segfault. This suggests that something is happening in a static singleton definition that is going wrong. My guess is a rocksdb singleton that (incorrect) does something with tcmalloc and fails due to an incompatible ABI?
Updated by Sage Weil about 6 years ago
on lab centos deploy,
[sage@smithi099 ~]$ lsb_release -a LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch Distributor ID: CentOS Description: CentOS Linux release 7.4.1708 (Core) Release: 7.4.1708 Codename: Core [sage@smithi099 ~]$ rpm -qf /lib64/libtcmalloc.so.4 gperftools-libs-2.4-8.el7.x86_64
vs rhel
[sage@smithi095 ~]$ lsb_release -a LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch Distributor ID: RedHatEnterpriseServer Description: Red Hat Enterprise Linux Server release 7.4 (Maipo) Release: 7.4 Codename: Maipo [sage@smithi095 ~]$ rpm -qf /lib64/libtcmalloc.so.4 gperftools-libs-2.4-8.el7.x86_64
Updated by Sage Weil about 6 years ago
except the job runs on rhel 7.5,
[sage@smithi116 ~]$ lsb_release -a LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch Distributor ID: RedHatEnterpriseServer Description: Red Hat Enterprise Linux Server release 7.5 (Maipo) Release: 7.5 Codename: Maipo [sage@smithi116 ~]$ rpm -qf /lib64/libtcmalloc.so.4 gperftools-libs-2.6.1-1.el7.x86_64
but there is no centos 7.5 image.
Updated by Sage Weil about 6 years ago
- Priority changed from Urgent to Immediate
/a/sage-2018-04-18_19:08:00-rados-wip-sage-testing-2018-04-18-1210-distro-basic-smithi/2413082
Updated by Kefu Chai about 6 years ago
- Status changed from New to Duplicate
tcmalloc 2.6.1 is buggy. probably we need runtime check to disallow ceph to run with "2.5 < tcmalloc.version < 2.6.2" in debian/control or ceph.spec.
or better off, applying the patch in https://bugzilla.redhat.com/show_bug.cgi?id=1494309 to gperftools-libs-2.6.1-1.el7.x86_64 shipped with RHEL 7.5
Updated by Kefu Chai about 6 years ago
i just filed https://bugzilla.redhat.com/show_bug.cgi?id=1569391 to track this issue at downstream.
Updated by Kefu Chai about 6 years ago
- Related to Bug #21422: crash in rocksdb LRUCache destructor with tcmalloc v4.2.6 / gperf-tools v2.5.93 added
Updated by Kefu Chai about 6 years ago
- Status changed from Duplicate to 12
change the status to verified. because, unlike #21422, this issue more of a run-time dependency problem.
Updated by Kefu Chai almost 6 years ago
i was thinking about statically linking against tcmalloc, but seems it's a dead-end.
see https://sourceware.org/bugzilla/show_bug.cgi?id=20432. and the glibc bug was fixed in 2.25, but RHEL/centos 7.4 comes with glibc v2.17. so we cannot link tcmalloc statically on RHEL/centos safely.
currently ceph pulls the gperftools-libs by depending on libtcmalloc.so.4. we could "Requires" gperftools-libs explicitly. but rpm's spec does not allow something like
Requires: gperftools-libs != 2.6.1-1
because <= 2.5, and >= 2.6.1-5 do not have this issue.
Updated by Kefu Chai almost 6 years ago
or we can notes this down as a known issue on RHEL7.5 and gperftools-libs 2.6.1-1.
Updated by Josh Durgin almost 6 years ago
- Assignee set to Kefu Chai
Discussed on irc, it appears we can work around this by replacing the single aligned_alloc() call in rocksdb with posix_memalign(), which we already use in bufferlist.
Updated by Kefu Chai almost 6 years ago
- Status changed from 12 to Fix Under Review
Updated by Kefu Chai almost 6 years ago
- Status changed from Fix Under Review to Resolved
Updated by Kefu Chai almost 6 years ago
- Status changed from Resolved to 12
we are now using centos 7.5 for building rpm. so we should drop this change in cmake.
Getting requirements for /tmp/install-deps.11183/ceph.spec --> 1:java-1.8.0-openjdk-devel-1.8.0.171-7.b10.el7.x86_64 --> sharutils-4.13.3-8.el7.x86_64 --> Already installed : checkpolicy-2.5-4.el7.x86_64 --> selinux-policy-devel-3.13.1-192.el7_5.3.noarch --> Already installed : bc-1.06.95-13.el7.x86_64 --> gperf-3.0.4-8.el7.x86_64 --> Already installed : cmake-2.8.12.2-2.el7.x86_64 --> cryptsetup-1.7.4-4.el7.x86_64 --> fuse-devel-2.9.2-10.el7.x86_64 --> devtoolset-7-gcc-c++-7.2.1-1.el7.sc1.x86_64 --> Already installed : gdbm-1.10-8.el7.x86_64 --> gperftools-devel-2.6.1-1.el7.x86_64
CMake Error at cmake/modules/BuildRocksDB.cmake:64 (message): Incompatible tcmalloc v2.6.1 and rocksdb v5.13.0, please install gperf-tools 2.5 (not 2.5.93) or >= 2.6.2, or switch to another allocator using 'cmake -DALLOCATOR=libc'. Call Stack (most recent call first): cmake/modules/BuildRocksDB.cmake:94 (check_aligned_alloc) src/CMakeLists.txt:860 (build_rocksdb)
Updated by Kefu Chai almost 6 years ago
https://github.com/ceph/ceph/pull/22046 to drop the check for tcmalloc
https://github.com/facebook/rocksdb/pull/3862 is posted to address the issue on rocksdb side.
Updated by Kefu Chai almost 6 years ago
- Status changed from 12 to Fix Under Review
Updated by Kefu Chai almost 6 years ago
- Status changed from Fix Under Review to Resolved
Updated by Kefu Chai almost 6 years ago
- Copied to Backport #24154: mimic: tcmalloc Attempt to free invalid pointer 0x55de11f2a540 in rocksdb::LRUCache::~LRUCache during mkfs->_open_db added
Updated by Kefu Chai over 5 years ago
- Related to Bug #35969: "symbol lookup error: ceph-osd: undefined symbol: _ZdaPvm" on centos 7.4 added