Bug #21422
closedcrash in rocksdb LRUCache destructor with tcmalloc v4.2.6 / gperf-tools v2.5.93
0%
Description
Did a git pull this morning (HEAD at 30b16ac8142ec87942d852992b8ad6672437ccf6), added a few of my patches on top and then tried to start vstart:
$ ../src/stop.sh ; ../src/vstart.sh --mon_num 1 --osd_num 1 --mds_num 1 -n 2017-09-18 13:49:52.036071 7ffff7f98540 -1 WARNING: all dangerous and experimental features are enabled. 2017-09-18 13:49:52.036173 7ffff7f98540 -1 WARNING: all dangerous and experimental features are enabled. 2017-09-18 13:49:52.038735 7ffff7f98540 -1 WARNING: all dangerous and experimental features are enabled. 2017-09-18 13:49:52.062077 7ffff7f98540 -1 WARNING: all dangerous and experimental features are enabled. 2017-09-18 13:49:52.062254 7ffff7f98540 -1 WARNING: all dangerous and experimental features are enabled. 2017-09-18 13:49:52.064798 7ffff7f98540 -1 WARNING: all dangerous and experimental features are enabled. mon_num:1 === mon.a === Stopping Ceph mon.a on tleilax...done 'rm' '-f' 'core*' hostname tleilax ip 192.168.1.3 port 6789 '/home/jlayton/git/ceph/build/bin/ceph-authtool' '--create-keyring' '--gen-key' '--name=mon.' '/home/jlayton/git/ceph/build/keyring' '--cap' 'mon' 'allow *' creating /home/jlayton/git/ceph/build/keyring '/home/jlayton/git/ceph/build/bin/ceph-authtool' '--gen-key' '--name=client.admin' '--set-uid=0' '--cap' 'mon' 'allow *' '--cap' 'osd' 'allow *' '--cap' 'mds' 'allow *' '--cap' 'mgr' 'allow *' '/home/jlayton/git/ceph/build/keyring' '/home/jlayton/git/ceph/build/bin/ceph-authtool' '--gen-key' '--name=client.rgw' '--cap' 'mon' 'allow rw' '--cap' 'osd' 'allow rwx' '--cap' 'mgr' 'allow rw' '/home/jlayton/git/ceph/build/keyring' '/home/jlayton/git/ceph/build/bin/monmaptool' '--create' '--clobber' '--add' 'a' '192.168.1.3:6789' '--print' '/tmp/ceph_monmap.21609' /home/jlayton/git/ceph/build/bin/monmaptool: monmap file /tmp/ceph_monmap.21609 /home/jlayton/git/ceph/build/bin/monmaptool: generated fsid 2b8f3b27-ed48-4f7d-aa9b-206e2bc34030 epoch 0 fsid 2b8f3b27-ed48-4f7d-aa9b-206e2bc34030 last_changed 2017-09-18 13:49:52.651602 created 2017-09-18 13:49:52.651602 0: 192.168.1.3:6789/0 mon.a /home/jlayton/git/ceph/build/bin/monmaptool: writing epoch 0 to /tmp/ceph_monmap.21609 (1 monitors) 'rm' '-rf' '--' '/home/jlayton/git/ceph/build/dev/mon.a' 'mkdir' '-p' '/home/jlayton/git/ceph/build/dev/mon.a' '/home/jlayton/git/ceph/build/bin/ceph-mon' '--mkfs' '-c' '/home/jlayton/git/ceph/build/ceph.conf' '-i' 'a' '--monmap=/tmp/ceph_monmap.21609' '--keyring=/home/jlayton/git/ceph/build/keyring' src/tcmalloc.cc:284] Attempt to free invalid pointer 0x55555f072540 *** Caught signal (Aborted) ** in thread 7ffff7fbb240 thread_name:ceph-mon ceph version 12.1.2-2065-g9ae8bd590896 (9ae8bd5908960f93de2928d3a5c5299fcca9b4d4) mimic (dev) 1: (()+0x8daf48) [0x555555e2ef48] 2: (()+0x123b0) [0x7ffff63453b0] 3: (gsignal()+0xcb) [0x7ffff462369b] 4: (abort()+0x1b0) [0x7ffff46254a0] 5: (tcmalloc::Log(tcmalloc::LogMode, char const*, int, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem)+0x229) [0x7ffff58c3a29] 6: (()+0x163d9) [0x7ffff58b83d9] 7: (rocksdb::LRUCache::~LRUCache()+0x65) [0x555555e96825] 8: (std::_Sp_counted_ptr<rocksdb::BlockBasedTableFactory*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x25a) [0x555555ef64ba] 9: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x46) [0x5555558aa496] 10: (rocksdb::ColumnFamilyOptions::~ColumnFamilyOptions()+0x1e) [0x5555559bb3de] 11: (RocksDBStore::init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x71) [0x5555559b2c41] 12: (MonitorDBStore::_open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x262) [0x5555558a7e82] 13: (MonitorDBStore::create_and_open(std::ostream&)+0xb3) [0x5555558a83f3] 14: (main()+0x1162) [0x5555557d52c2] 15: (__libc_start_main()+0xea) [0x7ffff460d50a] 16: (_start()+0x2a) [0x5555558a559a] 2017-09-18 13:49:52.699756 7ffff7fbb240 -1 *** Caught signal (Aborted) ** in thread 7ffff7fbb240 thread_name:ceph-mon ceph version 12.1.2-2065-g9ae8bd590896 (9ae8bd5908960f93de2928d3a5c5299fcca9b4d4) mimic (dev) 1: (()+0x8daf48) [0x555555e2ef48] 2: (()+0x123b0) [0x7ffff63453b0] 3: (gsignal()+0xcb) [0x7ffff462369b] 4: (abort()+0x1b0) [0x7ffff46254a0] 5: (tcmalloc::Log(tcmalloc::LogMode, char const*, int, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem)+0x229) [0x7ffff58c3a29] 6: (()+0x163d9) [0x7ffff58b83d9] 7: (rocksdb::LRUCache::~LRUCache()+0x65) [0x555555e96825] 8: (std::_Sp_counted_ptr<rocksdb::BlockBasedTableFactory*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x25a) [0x555555ef64ba] 9: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x46) [0x5555558aa496] 10: (rocksdb::ColumnFamilyOptions::~ColumnFamilyOptions()+0x1e) [0x5555559bb3de] 11: (RocksDBStore::init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x71) [0x5555559b2c41] 12: (MonitorDBStore::_open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x262) [0x5555558a7e82] 13: (MonitorDBStore::create_and_open(std::ostream&)+0xb3) [0x5555558a83f3] 14: (main()+0x1162) [0x5555557d52c2] 15: (__libc_start_main()+0xea) [0x7ffff460d50a] 16: (_start()+0x2a) [0x5555558a559a] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 0> 2017-09-18 13:49:52.699756 7ffff7fbb240 -1 *** Caught signal (Aborted) ** in thread 7ffff7fbb240 thread_name:ceph-mon ceph version 12.1.2-2065-g9ae8bd590896 (9ae8bd5908960f93de2928d3a5c5299fcca9b4d4) mimic (dev) 1: (()+0x8daf48) [0x555555e2ef48] 2: (()+0x123b0) [0x7ffff63453b0] 3: (gsignal()+0xcb) [0x7ffff462369b] 4: (abort()+0x1b0) [0x7ffff46254a0] 5: (tcmalloc::Log(tcmalloc::LogMode, char const*, int, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem)+0x229) [0x7ffff58c3a29] 6: (()+0x163d9) [0x7ffff58b83d9] 7: (rocksdb::LRUCache::~LRUCache()+0x65) [0x555555e96825] 8: (std::_Sp_counted_ptr<rocksdb::BlockBasedTableFactory*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x25a) [0x555555ef64ba] 9: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x46) [0x5555558aa496] 10: (rocksdb::ColumnFamilyOptions::~ColumnFamilyOptions()+0x1e) [0x5555559bb3de] 11: (RocksDBStore::init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x71) [0x5555559b2c41] 12: (MonitorDBStore::_open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x262) [0x5555558a7e82] 13: (MonitorDBStore::create_and_open(std::ostream&)+0xb3) [0x5555558a83f3] 14: (main()+0x1162) [0x5555557d52c2] 15: (__libc_start_main()+0xea) [0x7ffff460d50a] 16: (_start()+0x2a) [0x5555558a559a] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. ../src/vstart.sh: line 374: 21736 Aborted (core dumped) "$@"
I'm building and running this on relatively up to date f26. Seems like a double free or maybe corrupt pointer in there? I'm rebuilding the actual master branch now to verify that it's nothing I broke, but most of my patches would not affect this.
Updated by Jeff Layton over 6 years ago
(gdb) bt #0 0x00007ffff634525b in raise () from /lib64/libpthread.so.0 #1 0x0000555555e2ee52 in reraise_fatal (signum=6) at /home/jlayton/git/ceph/src/global/signal_handler.cc:74 #2 handle_fatal_signal (signum=6) at /home/jlayton/git/ceph/src/global/signal_handler.cc:138 #3 <signal handler called> #4 0x00007ffff462369b in raise () from /lib64/libc.so.6 #5 0x00007ffff46254a0 in abort () from /lib64/libc.so.6 #6 0x00007ffff58c3a29 in tcmalloc::Log(tcmalloc::LogMode, char const*, int, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem) () from /lib64/libtcmalloc.so.4 #7 0x00007ffff58b83d9 in (anonymous namespace)::InvalidFree(void*) () from /lib64/libtcmalloc.so.4 #8 0x0000555555e966a5 in rocksdb::LRUCache::~LRUCache (this=0x55555f1d5660, __in_chrg=<optimized out>) at /home/jlayton/git/ceph/src/rocksdb/cache/lru_cache.cc:476 #9 0x0000555555ef633a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55555f1d5650) at /usr/include/c++/7/bits/shared_ptr_base.h:154 #10 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x55555ee91108, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/shared_ptr_base.h:682 #11 std::__shared_ptr<rocksdb::Cache, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x55555ee91100, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/shared_ptr_base.h:1121 #12 std::shared_ptr<rocksdb::Cache>::~shared_ptr (this=0x55555ee91100, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/shared_ptr.h:93 #13 rocksdb::BlockBasedTableOptions::~BlockBasedTableOptions (this=0x55555ee910e8, __in_chrg=<optimized out>) at /home/jlayton/git/ceph/src/rocksdb/include/rocksdb/table.h:52 #14 rocksdb::BlockBasedTableFactory::~BlockBasedTableFactory (this=0x55555ee910e0, __in_chrg=<optimized out>) at /home/jlayton/git/ceph/src/rocksdb/table/block_based_table_factory.h:34 #15 rocksdb::BlockBasedTableFactory::~BlockBasedTableFactory (this=0x55555ee910e0, __in_chrg=<optimized out>) at /home/jlayton/git/ceph/src/rocksdb/table/block_based_table_factory.h:34 #16 std::_Sp_counted_ptr<rocksdb::BlockBasedTableFactory*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>) at /usr/include/c++/7/bits/shared_ptr_base.h:376 #17 0x00005555558aa4a6 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55555f1f0900) at /usr/include/c++/7/bits/shared_ptr_base.h:154 #18 0x00005555559bb3fe in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7ffffffface0, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/shared_ptr_base.h:682 #19 std::__shared_ptr<rocksdb::TableFactory, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7fffffffacd8, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/shared_ptr_base.h:1121 #20 std::shared_ptr<rocksdb::TableFactory>::~shared_ptr (this=0x7fffffffacd8, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/shared_ptr.h:93 #21 rocksdb::ColumnFamilyOptions::~ColumnFamilyOptions (this=0x7fffffffab18, __in_chrg=<optimized out>) at /home/jlayton/git/ceph/src/rocksdb/include/rocksdb/options.h:81 #22 0x00005555559b2c61 in rocksdb::Options::~Options (this=0x7fffffffa920, __in_chrg=<optimized out>) at /home/jlayton/git/ceph/src/rocksdb/include/rocksdb/options.h:905 #23 RocksDBStore::init (this=0x55555f15e000, _options_str=...) at /home/jlayton/git/ceph/src/kv/RocksDBStore.cc:219 #24 0x00005555558a7e92 in MonitorDBStore::_open (this=<optimized out>, kv_type=...) at /home/jlayton/git/ceph/src/mon/MonitorDBStore.h:624 #25 0x00005555558a8403 in MonitorDBStore::create_and_open (this=0x7fffffffb910, out=...) at /home/jlayton/git/ceph/src/mon/MonitorDBStore.h:658 #26 0x00005555557d52d2 in main (argc=<optimized out>, argv=0x7fffffffd568) at /home/jlayton/git/ceph/src/ceph_mon.cc:421 (gdb) f 8 #8 0x0000555555e966a5 in rocksdb::LRUCache::~LRUCache (this=0x55555f1d5660, __in_chrg=<optimized out>) at /home/jlayton/git/ceph/src/rocksdb/cache/lru_cache.cc:476 476 LRUCache::~LRUCache() { delete[] shards_; } (gdb) p shards_ $1 = (rocksdb::LRUCacheShard *) 0x55555f072580 (gdb) p this $2 = (rocksdb::LRUCache * const) 0x55555f1d5660 (gdb) p *this $3 = {<rocksdb::ShardedCache> = {<rocksdb::Cache> = {_vptr.Cache = 0x55555641f528 <vtable for rocksdb::LRUCache+16>}, num_shard_bits_ = 4, capacity_mutex_ = {mu_ = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, capacity_ = 8388608, strict_capacity_limit_ = false, last_id_ = {<std::__atomic_base<unsigned long>> = { static _S_alignment = 8, _M_i = 1}, <No data fields>}}, shards_ = 0x55555f072580, num_shards_ = 16}
That's about as far as I can dig as I don't really know this piece of code at all. I can reproduce this at will though, and am happy to test fixes if anyone has one.
Updated by Jeff Layton over 6 years ago
I was previously running on top of 65df66fe52a6ffac086881d4b7beb9aebcb1a3b2, so this seems like a regression since then.
Updated by Jeff Layton over 6 years ago
This looks like rocksdb got updated over the weekend?
commit 75922203337deec1ffd9eaf8993bfbfb356967b9 Merge: 8c992762a936 652bc5e83288 Author: Kefu Chai <tchaikov@gmail.com> Date: Sat Sep 16 01:23:56 2017 +0800 Merge pull request #17388 from tchaikov/wip-rocksdb rocksdb: sync with upstream Reviewed-by: Mark Nelson <mnelson@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
Updated by Jeff Layton over 6 years ago
I took a quick look at the rocksdb commits that went in between those two releases, and there is quite a bit of churn around the handling of the shards_ array. My naive guess is that something in there broke it.
Updated by Brad Hubbard over 6 years ago
I can confirm this is definitely caused by 75922203337deec1ffd9eaf8993bfbfb356967b9 Running a vstart cluster at 8c992762a9363cd39374e47541e25440900eb1ea (the commit before) works fine.
Updated by Kefu Chai over 6 years ago
- Subject changed from crash in LRUCache destructor to crash in rocksdb LRUCache destructor with tcmalloc v4.2.6 / gperf-tools v2.5.93
- Category set to build
- Status changed from New to Fix Under Review
- Assignee set to Kefu Chai
Updated by Kefu Chai over 6 years ago
tcmalloc is offered by gperf-tools, please install the latest gperf-tools. if you are using the buggy gperf 2.5.93.
$ ./configure --prefix=$HOME/local # under gperftools
$ make install
$ GPERF_ROOT=$HOME/local cmake .. # under ceph/build
Updated by Kefu Chai over 6 years ago
- Status changed from Fix Under Review to Resolved
BTW, the gperf-tools in fc25 should be fixed. IMO, it's a bug in gperf-tools 2.5.93.
Updated by Jeff Layton over 6 years ago
Kefu Chai wrote:
BTW, the gperf-tools in fc25 should be fixed. IMO, it's a bug in gperf-tools 2.5.93.
FWIW, I'm using f26, but either way...it'd be good to open a Fedora bug, so we can get the packages there fixed.
Updated by Kefu Chai over 6 years ago
thanks to Brad, fc26 bug reported at https://bugzilla.redhat.com/show_bug.cgi?id=1494309
Updated by Kefu Chai about 6 years ago
Updated by Kefu Chai about 6 years ago
- Related to Bug #23653: tcmalloc Attempt to free invalid pointer 0x55de11f2a540 in rocksdb::LRUCache::~LRUCache during mkfs->_open_db added