Project

General

Profile

Actions

Bug #21422

closed

crash in rocksdb LRUCache destructor with tcmalloc v4.2.6 / gperf-tools v2.5.93

Added by Jeff Layton over 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
build
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Did a git pull this morning (HEAD at 30b16ac8142ec87942d852992b8ad6672437ccf6), added a few of my patches on top and then tried to start vstart:

$ ../src/stop.sh ; ../src/vstart.sh --mon_num 1 --osd_num 1 --mds_num 1 -n
2017-09-18 13:49:52.036071 7ffff7f98540 -1 WARNING: all dangerous and experimental features are enabled.
2017-09-18 13:49:52.036173 7ffff7f98540 -1 WARNING: all dangerous and experimental features are enabled.
2017-09-18 13:49:52.038735 7ffff7f98540 -1 WARNING: all dangerous and experimental features are enabled.
2017-09-18 13:49:52.062077 7ffff7f98540 -1 WARNING: all dangerous and experimental features are enabled.
2017-09-18 13:49:52.062254 7ffff7f98540 -1 WARNING: all dangerous and experimental features are enabled.
2017-09-18 13:49:52.064798 7ffff7f98540 -1 WARNING: all dangerous and experimental features are enabled.
mon_num:1
=== mon.a === 
Stopping Ceph mon.a on tleilax...done
'rm' '-f' 'core*' 
hostname tleilax
ip 192.168.1.3
port 6789
'/home/jlayton/git/ceph/build/bin/ceph-authtool' '--create-keyring' '--gen-key' '--name=mon.' '/home/jlayton/git/ceph/build/keyring' '--cap' 'mon' 'allow *' 
creating /home/jlayton/git/ceph/build/keyring
'/home/jlayton/git/ceph/build/bin/ceph-authtool' '--gen-key' '--name=client.admin' '--set-uid=0' '--cap' 'mon' 'allow *' '--cap' 'osd' 'allow *' '--cap' 'mds' 'allow *' '--cap' 'mgr' 'allow *' '/home/jlayton/git/ceph/build/keyring' 
'/home/jlayton/git/ceph/build/bin/ceph-authtool' '--gen-key' '--name=client.rgw' '--cap' 'mon' 'allow rw' '--cap' 'osd' 'allow rwx' '--cap' 'mgr' 'allow rw' '/home/jlayton/git/ceph/build/keyring' 
'/home/jlayton/git/ceph/build/bin/monmaptool' '--create' '--clobber' '--add' 'a' '192.168.1.3:6789' '--print' '/tmp/ceph_monmap.21609' 
/home/jlayton/git/ceph/build/bin/monmaptool: monmap file /tmp/ceph_monmap.21609
/home/jlayton/git/ceph/build/bin/monmaptool: generated fsid 2b8f3b27-ed48-4f7d-aa9b-206e2bc34030
epoch 0
fsid 2b8f3b27-ed48-4f7d-aa9b-206e2bc34030
last_changed 2017-09-18 13:49:52.651602
created 2017-09-18 13:49:52.651602
0: 192.168.1.3:6789/0 mon.a
/home/jlayton/git/ceph/build/bin/monmaptool: writing epoch 0 to /tmp/ceph_monmap.21609 (1 monitors)
'rm' '-rf' '--' '/home/jlayton/git/ceph/build/dev/mon.a' 
'mkdir' '-p' '/home/jlayton/git/ceph/build/dev/mon.a' 
'/home/jlayton/git/ceph/build/bin/ceph-mon' '--mkfs' '-c' '/home/jlayton/git/ceph/build/ceph.conf' '-i' 'a' '--monmap=/tmp/ceph_monmap.21609' '--keyring=/home/jlayton/git/ceph/build/keyring' 
src/tcmalloc.cc:284] Attempt to free invalid pointer 0x55555f072540 
*** Caught signal (Aborted) **
 in thread 7ffff7fbb240 thread_name:ceph-mon
 ceph version 12.1.2-2065-g9ae8bd590896 (9ae8bd5908960f93de2928d3a5c5299fcca9b4d4) mimic (dev)
 1: (()+0x8daf48) [0x555555e2ef48]
 2: (()+0x123b0) [0x7ffff63453b0]
 3: (gsignal()+0xcb) [0x7ffff462369b]
 4: (abort()+0x1b0) [0x7ffff46254a0]
 5: (tcmalloc::Log(tcmalloc::LogMode, char const*, int, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem)+0x229) [0x7ffff58c3a29]
 6: (()+0x163d9) [0x7ffff58b83d9]
 7: (rocksdb::LRUCache::~LRUCache()+0x65) [0x555555e96825]
 8: (std::_Sp_counted_ptr<rocksdb::BlockBasedTableFactory*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x25a) [0x555555ef64ba]
 9: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x46) [0x5555558aa496]
 10: (rocksdb::ColumnFamilyOptions::~ColumnFamilyOptions()+0x1e) [0x5555559bb3de]
 11: (RocksDBStore::init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x71) [0x5555559b2c41]
 12: (MonitorDBStore::_open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x262) [0x5555558a7e82]
 13: (MonitorDBStore::create_and_open(std::ostream&)+0xb3) [0x5555558a83f3]
 14: (main()+0x1162) [0x5555557d52c2]
 15: (__libc_start_main()+0xea) [0x7ffff460d50a]
 16: (_start()+0x2a) [0x5555558a559a]
2017-09-18 13:49:52.699756 7ffff7fbb240 -1 *** Caught signal (Aborted) **
 in thread 7ffff7fbb240 thread_name:ceph-mon

 ceph version 12.1.2-2065-g9ae8bd590896 (9ae8bd5908960f93de2928d3a5c5299fcca9b4d4) mimic (dev)
 1: (()+0x8daf48) [0x555555e2ef48]
 2: (()+0x123b0) [0x7ffff63453b0]
 3: (gsignal()+0xcb) [0x7ffff462369b]
 4: (abort()+0x1b0) [0x7ffff46254a0]
 5: (tcmalloc::Log(tcmalloc::LogMode, char const*, int, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem)+0x229) [0x7ffff58c3a29]
 6: (()+0x163d9) [0x7ffff58b83d9]
 7: (rocksdb::LRUCache::~LRUCache()+0x65) [0x555555e96825]
 8: (std::_Sp_counted_ptr<rocksdb::BlockBasedTableFactory*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x25a) [0x555555ef64ba]
 9: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x46) [0x5555558aa496]
 10: (rocksdb::ColumnFamilyOptions::~ColumnFamilyOptions()+0x1e) [0x5555559bb3de]
 11: (RocksDBStore::init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x71) [0x5555559b2c41]
 12: (MonitorDBStore::_open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x262) [0x5555558a7e82]
 13: (MonitorDBStore::create_and_open(std::ostream&)+0xb3) [0x5555558a83f3]
 14: (main()+0x1162) [0x5555557d52c2]
 15: (__libc_start_main()+0xea) [0x7ffff460d50a]
 16: (_start()+0x2a) [0x5555558a559a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2017-09-18 13:49:52.699756 7ffff7fbb240 -1 *** Caught signal (Aborted) **
 in thread 7ffff7fbb240 thread_name:ceph-mon

 ceph version 12.1.2-2065-g9ae8bd590896 (9ae8bd5908960f93de2928d3a5c5299fcca9b4d4) mimic (dev)
 1: (()+0x8daf48) [0x555555e2ef48]
 2: (()+0x123b0) [0x7ffff63453b0]
 3: (gsignal()+0xcb) [0x7ffff462369b]
 4: (abort()+0x1b0) [0x7ffff46254a0]
 5: (tcmalloc::Log(tcmalloc::LogMode, char const*, int, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem)+0x229) [0x7ffff58c3a29]
 6: (()+0x163d9) [0x7ffff58b83d9]
 7: (rocksdb::LRUCache::~LRUCache()+0x65) [0x555555e96825]
 8: (std::_Sp_counted_ptr<rocksdb::BlockBasedTableFactory*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x25a) [0x555555ef64ba]
 9: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x46) [0x5555558aa496]
 10: (rocksdb::ColumnFamilyOptions::~ColumnFamilyOptions()+0x1e) [0x5555559bb3de]
 11: (RocksDBStore::init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x71) [0x5555559b2c41]
 12: (MonitorDBStore::_open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x262) [0x5555558a7e82]
 13: (MonitorDBStore::create_and_open(std::ostream&)+0xb3) [0x5555558a83f3]
 14: (main()+0x1162) [0x5555557d52c2]
 15: (__libc_start_main()+0xea) [0x7ffff460d50a]
 16: (_start()+0x2a) [0x5555558a559a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

../src/vstart.sh: line 374: 21736 Aborted                 (core dumped) "$@" 

I'm building and running this on relatively up to date f26. Seems like a double free or maybe corrupt pointer in there? I'm rebuilding the actual master branch now to verify that it's nothing I broke, but most of my patches would not affect this.


Related issues 1 (0 open1 closed)

Related to bluestore - Bug #23653: tcmalloc Attempt to free invalid pointer 0x55de11f2a540 in rocksdb::LRUCache::~LRUCache during mkfs->_open_dbResolvedKefu Chai04/11/2018

Actions
Actions #1

Updated by Jeff Layton over 6 years ago

(gdb) bt
#0  0x00007ffff634525b in raise () from /lib64/libpthread.so.0
#1  0x0000555555e2ee52 in reraise_fatal (signum=6) at /home/jlayton/git/ceph/src/global/signal_handler.cc:74
#2  handle_fatal_signal (signum=6) at /home/jlayton/git/ceph/src/global/signal_handler.cc:138
#3  <signal handler called>
#4  0x00007ffff462369b in raise () from /lib64/libc.so.6
#5  0x00007ffff46254a0 in abort () from /lib64/libc.so.6
#6  0x00007ffff58c3a29 in tcmalloc::Log(tcmalloc::LogMode, char const*, int, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem) ()
   from /lib64/libtcmalloc.so.4
#7  0x00007ffff58b83d9 in (anonymous namespace)::InvalidFree(void*) () from /lib64/libtcmalloc.so.4
#8  0x0000555555e966a5 in rocksdb::LRUCache::~LRUCache (this=0x55555f1d5660, __in_chrg=<optimized out>) at /home/jlayton/git/ceph/src/rocksdb/cache/lru_cache.cc:476
#9  0x0000555555ef633a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55555f1d5650) at /usr/include/c++/7/bits/shared_ptr_base.h:154
#10 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x55555ee91108, __in_chrg=<optimized out>)
    at /usr/include/c++/7/bits/shared_ptr_base.h:682
#11 std::__shared_ptr<rocksdb::Cache, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x55555ee91100, __in_chrg=<optimized out>)
    at /usr/include/c++/7/bits/shared_ptr_base.h:1121
#12 std::shared_ptr<rocksdb::Cache>::~shared_ptr (this=0x55555ee91100, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/shared_ptr.h:93
#13 rocksdb::BlockBasedTableOptions::~BlockBasedTableOptions (this=0x55555ee910e8, __in_chrg=<optimized out>)
    at /home/jlayton/git/ceph/src/rocksdb/include/rocksdb/table.h:52
#14 rocksdb::BlockBasedTableFactory::~BlockBasedTableFactory (this=0x55555ee910e0, __in_chrg=<optimized out>)
    at /home/jlayton/git/ceph/src/rocksdb/table/block_based_table_factory.h:34
#15 rocksdb::BlockBasedTableFactory::~BlockBasedTableFactory (this=0x55555ee910e0, __in_chrg=<optimized out>)
    at /home/jlayton/git/ceph/src/rocksdb/table/block_based_table_factory.h:34
#16 std::_Sp_counted_ptr<rocksdb::BlockBasedTableFactory*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>)
    at /usr/include/c++/7/bits/shared_ptr_base.h:376
#17 0x00005555558aa4a6 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55555f1f0900) at /usr/include/c++/7/bits/shared_ptr_base.h:154
#18 0x00005555559bb3fe in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7ffffffface0, __in_chrg=<optimized out>)
    at /usr/include/c++/7/bits/shared_ptr_base.h:682
#19 std::__shared_ptr<rocksdb::TableFactory, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7fffffffacd8, __in_chrg=<optimized out>)
    at /usr/include/c++/7/bits/shared_ptr_base.h:1121
#20 std::shared_ptr<rocksdb::TableFactory>::~shared_ptr (this=0x7fffffffacd8, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/shared_ptr.h:93
#21 rocksdb::ColumnFamilyOptions::~ColumnFamilyOptions (this=0x7fffffffab18, __in_chrg=<optimized out>)
    at /home/jlayton/git/ceph/src/rocksdb/include/rocksdb/options.h:81
#22 0x00005555559b2c61 in rocksdb::Options::~Options (this=0x7fffffffa920, __in_chrg=<optimized out>)
    at /home/jlayton/git/ceph/src/rocksdb/include/rocksdb/options.h:905
#23 RocksDBStore::init (this=0x55555f15e000, _options_str=...) at /home/jlayton/git/ceph/src/kv/RocksDBStore.cc:219
#24 0x00005555558a7e92 in MonitorDBStore::_open (this=<optimized out>, kv_type=...) at /home/jlayton/git/ceph/src/mon/MonitorDBStore.h:624
#25 0x00005555558a8403 in MonitorDBStore::create_and_open (this=0x7fffffffb910, out=...) at /home/jlayton/git/ceph/src/mon/MonitorDBStore.h:658
#26 0x00005555557d52d2 in main (argc=<optimized out>, argv=0x7fffffffd568) at /home/jlayton/git/ceph/src/ceph_mon.cc:421
(gdb) f 8
#8  0x0000555555e966a5 in rocksdb::LRUCache::~LRUCache (this=0x55555f1d5660, __in_chrg=<optimized out>) at /home/jlayton/git/ceph/src/rocksdb/cache/lru_cache.cc:476
476     LRUCache::~LRUCache() { delete[] shards_; }
(gdb) p shards_
$1 = (rocksdb::LRUCacheShard *) 0x55555f072580
(gdb) p this
$2 = (rocksdb::LRUCache * const) 0x55555f1d5660
(gdb) p *this
$3 = {<rocksdb::ShardedCache> = {<rocksdb::Cache> = {_vptr.Cache = 0x55555641f528 <vtable for rocksdb::LRUCache+16>}, num_shard_bits_ = 4, capacity_mutex_ = {mu_ = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, 
          __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, capacity_ = 8388608, strict_capacity_limit_ = false, last_id_ = {<std::__atomic_base<unsigned long>> = {
        static _S_alignment = 8, _M_i = 1}, <No data fields>}}, shards_ = 0x55555f072580, num_shards_ = 16}

That's about as far as I can dig as I don't really know this piece of code at all. I can reproduce this at will though, and am happy to test fixes if anyone has one.

Actions #2

Updated by Jeff Layton over 6 years ago

I was previously running on top of 65df66fe52a6ffac086881d4b7beb9aebcb1a3b2, so this seems like a regression since then.

Actions #3

Updated by Jeff Layton over 6 years ago

This looks like rocksdb got updated over the weekend?


commit 75922203337deec1ffd9eaf8993bfbfb356967b9
Merge: 8c992762a936 652bc5e83288
Author: Kefu Chai <tchaikov@gmail.com>
Date:   Sat Sep 16 01:23:56 2017 +0800

    Merge pull request #17388 from tchaikov/wip-rocksdb

    rocksdb: sync with upstream

    Reviewed-by: Mark Nelson <mnelson@redhat.com>
    Reviewed-by: Sage Weil <sage@redhat.com>

Actions #4

Updated by Jeff Layton over 6 years ago

I took a quick look at the rocksdb commits that went in between those two releases, and there is quite a bit of churn around the handling of the shards_ array. My naive guess is that something in there broke it.

Actions #5

Updated by Brad Hubbard over 6 years ago

I can confirm this is definitely caused by 75922203337deec1ffd9eaf8993bfbfb356967b9 Running a vstart cluster at 8c992762a9363cd39374e47541e25440900eb1ea (the commit before) works fine.

Actions #6

Updated by Kefu Chai over 6 years ago

  • Subject changed from crash in LRUCache destructor to crash in rocksdb LRUCache destructor with tcmalloc v4.2.6 / gperf-tools v2.5.93
  • Category set to build
  • Status changed from New to Fix Under Review
  • Assignee set to Kefu Chai
Actions #7

Updated by Kefu Chai over 6 years ago

tcmalloc is offered by gperf-tools, please install the latest gperf-tools. if you are using the buggy gperf 2.5.93.

$ ./configure --prefix=$HOME/local # under gperftools
$ make install
$ GPERF_ROOT=$HOME/local cmake .. # under ceph/build

Actions #8

Updated by Kefu Chai over 6 years ago

  • Status changed from Fix Under Review to Resolved

BTW, the gperf-tools in fc25 should be fixed. IMO, it's a bug in gperf-tools 2.5.93.

Actions #9

Updated by Jeff Layton over 6 years ago

Kefu Chai wrote:

BTW, the gperf-tools in fc25 should be fixed. IMO, it's a bug in gperf-tools 2.5.93.

FWIW, I'm using f26, but either way...it'd be good to open a Fedora bug, so we can get the packages there fixed.

Actions #10

Updated by Kefu Chai over 6 years ago

thanks to Brad, fc26 bug reported at https://bugzilla.redhat.com/show_bug.cgi?id=1494309

Actions #12

Updated by Kefu Chai about 6 years ago

  • Related to Bug #23653: tcmalloc Attempt to free invalid pointer 0x55de11f2a540 in rocksdb::LRUCache::~LRUCache during mkfs->_open_db added
Actions

Also available in: Atom PDF