Bug #36172
openosd: hit suicide timeout
0%
Description
ceph version 0.94.9-9.el7cp
A osd-drive died some days agoo and after a restart today again with the same error:
= osd.115 10.24.53.152:6807/15666 566 ==== osd_ping(you_died e15168 stamp 2018-09-25 08:55:11.163988) v2 ==== 47+0+0 (923195720 0 0) 0x2ce00f600 con 0x2b95a42c0
-2> 2018-09-25 08:55:12.980462 7fc88a684700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fc7e4492700' had timed out after 15
-1> 2018-09-25 08:55:12.980471 7fc88a684700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fc7e4492700' had suicide timed out after 150
0> 2018-09-25 08:55:12.981683 7fc88a684700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fc88a684700 time 2018-09-25 08:55:12.980485
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
ceph version 0.94.9-9.el7cp (b83334e01379f267fb2f9ce729d74a0a8fa1e92c)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xb12ef5]
2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0xa46399]
3: (ceph::HeartbeatMap::is_healthy()+0xde) [0xa46c8e]
4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0xa473ac]
5: (CephContextServiceThread::entry()+0x15b) [0xb2333b]
6: (()+0x7dc5) [0x7fc88de8ddc5]
7: (clone()+0x6d) [0x7fc88c97073d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
2/-2 (syslog threshold) end dump of recent events ---
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.18.log
--
2018-09-25 08:55:13.015129 7fc88a684700 -1 ** Caught signal (Aborted) *
in thread 7fc88a684700
ceph version 0.94.9-9.el7cp (b83334e01379f267fb2f9ce729d74a0a8fa1e92c)
1: /usr/bin/ceph-osd() [0xa0f922]
2: (()+0xf370) [0x7fc88de95370]
3: (gsignal()+0x37) [0x7fc88c8ae1d7]
4: (abort()+0x148) [0x7fc88c8af8c8]
5: (_gnu_cxx::_verbose_terminate_handler()+0x165) [0x7fc88d1b2ab5]
6: (()+0x5ea26) [0x7fc88d1b0a26]
7: (()+0x5ea53) [0x7fc88d1b0a53]
8: (()+0x5ec73) [0x7fc88d1b0c73]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xb130ea]
10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0xa46399]
11: (ceph::HeartbeatMap::is_healthy()+0xde) [0xa46c8e]
12: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0xa473ac]
13: (CephContextServiceThread::entry()+0x15b) [0xb2333b]
14: (()+0x7dc5) [0x7fc88de8ddc5]
15: (clone()+0x6d) [0x7fc88c97073d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- begin dump of recent events ---
0> 2018-09-25 08:55:13.015129 7fc88a684700 -1 ** Caught signal (Aborted) *
in thread 7fc88a684700
ceph version 0.94.9-9.el7cp (b83334e01379f267fb2f9ce729d74a0a8fa1e92c)
1: /usr/bin/ceph-osd() [0xa0f922]
2: (()+0xf370) [0x7fc88de95370]
3: (gsignal()+0x37) [0x7fc88c8ae1d7]
4: (abort()+0x148) [0x7fc88c8af8c8]
5: (_gnu_cxx::_verbose_terminate_handler()+0x165) [0x7fc88d1b2ab5]
6: (()+0x5ea26) [0x7fc88d1b0a26]
7: (()+0x5ea53) [0x7fc88d1b0a53]
8: (()+0x5ec73) [0x7fc88d1b0c73]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xb130ea]
10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0xa46399]
11: (ceph::HeartbeatMap::is_healthy()+0xde) [0xa46c8e]
12: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0xa473ac]
13: (CephContextServiceThread::entry()+0x15b) [0xb2333b]
14: (()+0x7dc5) [0x7fc88de8ddc5]
15: (clone()+0x6d) [0x7fc88c97073d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.18.log
Updated by John Spray over 5 years ago
- Project changed from Ceph to RADOS
- Subject changed from hit suicide timeout to osd: hit suicide timeout
- Category deleted (
OSD)
Updated by Brad Hubbard over 5 years ago
- Assignee set to Brad Hubbard
Most likely can't flush filestore output to the hardware. Can you thoroughly check the hardware is in perfect working order? BTW, Hammer has been EOL for over a year.