Project

General

Profile

Bug #11558

assert(0 == "hit suicide timeout") in HeartbeatMap due to dead lock in rocksdb

Added by Xinze Chi about 5 years ago. Updated about 5 years ago.

Status:
Won't Fix
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

hi, all:

when I use kvstore(I use rocksdb as its kv impl). the ceph verion is 0.94, the rocksdb 05da593(which is default branch in hammer).

There is bug.

The track log is:

#1  0x00000000009e701d in rocksdb::port::CondVar::Wait (this=this@entry=0x41d08b8) at port/port_posix.cc:80
#2  0x00000000009b4a00 in rocksdb::DBImpl::MakeRoomForWrite (this=this@entry=0x41d0780, cfd=cfd@entry=0x40c0700, force=false, superversions_to_free=superversions_to_free@entry=0x7f69cb405490, 
     logs_to_free=logs_to_free@entry=0x7f69cb4054f0) at db/db_impl.cc:3958
#3  0x00000000009bd1cf in rocksdb::DBImpl::Write (this=0x41d0780, options=..., my_batch=0x4087760) at db/db_impl.cc:3671
#4  0x00000000009a2e60 in RocksDBStore::submit_transaction_sync (this=0x41c0b60, t=...) at os/RocksDBStore.cc:231

#1  0x00000000009e701d in rocksdb::port::CondVar::Wait (this=this@entry=0x7f69cac04450) at port/port_posix.cc:80
#2  0x00000000009bd044 in rocksdb::DBImpl::Write (this=0x41d0780, options=..., my_batch=0x4084fb0) at db/db_impl.cc:3632
#3  0x00000000009a2e60 in RocksDBStore::submit_transaction_sync (this=0x41c0b60, t=...) at os/RocksDBStore.cc:231
#4  0x000000000094694a in submit_transaction_sync (t=..., this=<optimized out>) at os/GenericObjectMap.h:133

And then when I change the rocksdb version to 6ca7bef(default branch ceph master brach use), everyting go well.

History

#1 Updated by Haomai Wang about 5 years ago

  • Category set to OSD
  • Target version set to v0.94

#2 Updated by Kefu Chai about 5 years ago

  • Description updated (diff)

#3 Updated by Xinze Chi about 5 years ago

common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

ceph version 0.94-17-gc341f91 (c341f91ed1be851f60ceec37ad5789c4ddd4122e)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x78) [0xbbf8d8]
2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2a9) [0xafcdc9]
3: (ceph::HeartbeatMap::is_healthy()+0xce) [0xafd64e]
4: (ceph::HeartbeatMap::check_touch_file()+0x17) [0xafdd27]
5: (CephContextServiceThread::entry()+0x14b) [0xbcf62b]
6: (()+0x7df3) [0x7f3922606df3]
7: (clone()+0x6d) [0x7f3920ed53dd]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

This is log from ceph osd:

2015-05-07 15:05:49.627738 7f391fd65700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f39090b1700' had timed out after 15
2015-05-07 15:05:49.627740 7f391fd65700 1 heartbeat_map is_healthy 'KeyValueStore::op_tp thread 0x7f3917a39700' had timed out after 60
2015-05-07 15:05:49.627742 7f391fd65700 1 heartbeat_map is_healthy 'KeyValueStore::op_tp thread 0x7f391823a700' had timed out after 60
2015-05-07 15:05:54.627869 7f391fd65700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f39080af700' had timed out after 15
2015-05-07 15:05:54.627886 7f391fd65700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f39090b1700' had timed out after 15
2015-05-07 15:05:54.627889 7f391fd65700 1 heartbeat_map is_healthy 'KeyValueStore::op_tp thread 0x7f3917a39700' had timed out after 60
2015-05-07 15:05:54.627891 7f391fd65700 1 heartbeat_map is_healthy 'KeyValueStore::op_tp thread 0x7f391823a700' had timed out after 60
2015-05-07 15:05:59.628012 7f391fd65700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f39080af700' had timed out after 15
2015-05-07 15:05:59.628029 7f391fd65700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f39090b1700' had timed out after 15
2015-05-07 15:05:59.628035 7f391fd65700 1 heartbeat_map is_healthy 'KeyValueStore::op_tp thread 0x7f3917a39700' had timed out after 60
2015-05-07 15:05:59.628036 7f391fd65700 1 heartbeat_map is_healthy 'KeyValueStore::op_tp thread 0x7f391823a700' had timed out after 60
2015-05-07 15:06:04.628159 7f391fd65700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f39080af700' had timed out after 15
2015-05-07 15:06:04.628176 7f391fd65700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f39090b1700' had timed out after 15
2015-05-07 15:06:04.628178 7f391fd65700 1 heartbeat_map is_healthy 'KeyValueStore::op_tp thread 0x7f3917a39700' had timed out after 60
2015-05-07 15:06:04.628180 7f391fd65700 1 heartbeat_map is_healthy 'KeyValueStore::op_tp thread 0x7f391823a700' had timed out after 60
2015-05-07 15:06:09.628244 7f391fd65700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f39080af700' had timed out after 15
2015-05-07 15:06:09.628262 7f391fd65700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f39090b1700' had timed out after 15
2015-05-07 15:06:09.628264 7f391fd65700 1 heartbeat_map is_healthy 'KeyValueStore::op_tp thread 0x7f3917a39700' had timed out after 60
2015-05-07 15:06:09.628265 7f391fd65700 1 heartbeat_map is_healthy 'KeyValueStore::op_tp thread 0x7f391823a700' had timed out after 60

#4 Updated by Kefu Chai about 5 years ago

  • Subject changed from rocksdb bug to assert(0 == "hit suicide timeout") in HeartbeatMap due to dead lock in rocksdb

#5 Updated by Xinze Chi about 5 years ago

the use the default rocksdb::max_open_files 5000 in commit 0bd767fb7ecca78033dc9d99f221e88ad0c4b289.
But the system max open files is 1024.

So I think maybe rocksdb::max_open_files 5000 should not be the default value in ceph.conf.

#6 Updated by Kefu Chai about 5 years ago

  • Status changed from New to Won't Fix

since the kv store is still an experimental feature, and its settings are subject to change, i am closing it as "won't fix". also i am asking xiaoxi for his opinion at https://github.com/ceph/ceph/commit/0bd767fb7ecca78033dc9d99f221e88ad0c4b289 .

Also available in: Atom PDF