Bug #11558
closedassert(0 == "hit suicide timeout") in HeartbeatMap due to dead lock in rocksdb
0%
Description
hi, all:
when I use kvstore(I use rocksdb as its kv impl). the ceph verion is 0.94, the rocksdb 05da593(which is default branch in hammer).
There is bug.
The track log is:
#1 0x00000000009e701d in rocksdb::port::CondVar::Wait (this=this@entry=0x41d08b8) at port/port_posix.cc:80 #2 0x00000000009b4a00 in rocksdb::DBImpl::MakeRoomForWrite (this=this@entry=0x41d0780, cfd=cfd@entry=0x40c0700, force=false, superversions_to_free=superversions_to_free@entry=0x7f69cb405490, logs_to_free=logs_to_free@entry=0x7f69cb4054f0) at db/db_impl.cc:3958 #3 0x00000000009bd1cf in rocksdb::DBImpl::Write (this=0x41d0780, options=..., my_batch=0x4087760) at db/db_impl.cc:3671 #4 0x00000000009a2e60 in RocksDBStore::submit_transaction_sync (this=0x41c0b60, t=...) at os/RocksDBStore.cc:231 #1 0x00000000009e701d in rocksdb::port::CondVar::Wait (this=this@entry=0x7f69cac04450) at port/port_posix.cc:80 #2 0x00000000009bd044 in rocksdb::DBImpl::Write (this=0x41d0780, options=..., my_batch=0x4084fb0) at db/db_impl.cc:3632 #3 0x00000000009a2e60 in RocksDBStore::submit_transaction_sync (this=0x41c0b60, t=...) at os/RocksDBStore.cc:231 #4 0x000000000094694a in submit_transaction_sync (t=..., this=<optimized out>) at os/GenericObjectMap.h:133
And then when I change the rocksdb version to 6ca7bef(default branch ceph master brach use), everyting go well.
Updated by Haomai Wang almost 9 years ago
- Category set to OSD
- Target version set to v0.94
Updated by Xinze Chi almost 9 years ago
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
ceph version 0.94-17-gc341f91 (c341f91ed1be851f60ceec37ad5789c4ddd4122e)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x78) [0xbbf8d8]
2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2a9) [0xafcdc9]
3: (ceph::HeartbeatMap::is_healthy()+0xce) [0xafd64e]
4: (ceph::HeartbeatMap::check_touch_file()+0x17) [0xafdd27]
5: (CephContextServiceThread::entry()+0x14b) [0xbcf62b]
6: (()+0x7df3) [0x7f3922606df3]
7: (clone()+0x6d) [0x7f3920ed53dd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
This is log from ceph osd:
2015-05-07 15:05:49.627738 7f391fd65700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f39090b1700' had timed out after 15
2015-05-07 15:05:49.627740 7f391fd65700 1 heartbeat_map is_healthy 'KeyValueStore::op_tp thread 0x7f3917a39700' had timed out after 60
2015-05-07 15:05:49.627742 7f391fd65700 1 heartbeat_map is_healthy 'KeyValueStore::op_tp thread 0x7f391823a700' had timed out after 60
2015-05-07 15:05:54.627869 7f391fd65700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f39080af700' had timed out after 15
2015-05-07 15:05:54.627886 7f391fd65700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f39090b1700' had timed out after 15
2015-05-07 15:05:54.627889 7f391fd65700 1 heartbeat_map is_healthy 'KeyValueStore::op_tp thread 0x7f3917a39700' had timed out after 60
2015-05-07 15:05:54.627891 7f391fd65700 1 heartbeat_map is_healthy 'KeyValueStore::op_tp thread 0x7f391823a700' had timed out after 60
2015-05-07 15:05:59.628012 7f391fd65700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f39080af700' had timed out after 15
2015-05-07 15:05:59.628029 7f391fd65700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f39090b1700' had timed out after 15
2015-05-07 15:05:59.628035 7f391fd65700 1 heartbeat_map is_healthy 'KeyValueStore::op_tp thread 0x7f3917a39700' had timed out after 60
2015-05-07 15:05:59.628036 7f391fd65700 1 heartbeat_map is_healthy 'KeyValueStore::op_tp thread 0x7f391823a700' had timed out after 60
2015-05-07 15:06:04.628159 7f391fd65700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f39080af700' had timed out after 15
2015-05-07 15:06:04.628176 7f391fd65700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f39090b1700' had timed out after 15
2015-05-07 15:06:04.628178 7f391fd65700 1 heartbeat_map is_healthy 'KeyValueStore::op_tp thread 0x7f3917a39700' had timed out after 60
2015-05-07 15:06:04.628180 7f391fd65700 1 heartbeat_map is_healthy 'KeyValueStore::op_tp thread 0x7f391823a700' had timed out after 60
2015-05-07 15:06:09.628244 7f391fd65700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f39080af700' had timed out after 15
2015-05-07 15:06:09.628262 7f391fd65700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f39090b1700' had timed out after 15
2015-05-07 15:06:09.628264 7f391fd65700 1 heartbeat_map is_healthy 'KeyValueStore::op_tp thread 0x7f3917a39700' had timed out after 60
2015-05-07 15:06:09.628265 7f391fd65700 1 heartbeat_map is_healthy 'KeyValueStore::op_tp thread 0x7f391823a700' had timed out after 60
Updated by Kefu Chai almost 9 years ago
- Subject changed from rocksdb bug to assert(0 == "hit suicide timeout") in HeartbeatMap due to dead lock in rocksdb
Updated by Xinze Chi almost 9 years ago
the use the default rocksdb::max_open_files 5000 in commit 0bd767fb7ecca78033dc9d99f221e88ad0c4b289.
But the system max open files is 1024.
So I think maybe rocksdb::max_open_files 5000 should not be the default value in ceph.conf.
Updated by Kefu Chai almost 9 years ago
- Status changed from New to Won't Fix
since the kv store is still an experimental feature, and its settings are subject to change, i am closing it as "won't fix". also i am asking xiaoxi for his opinion at https://github.com/ceph/ceph/commit/0bd767fb7ecca78033dc9d99f221e88ad0c4b289 .