Project

General

Profile

Actions

Bug #36172

open

osd: hit suicide timeout

Added by Bernd Hennig over 5 years ago. Updated over 5 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-disk
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph version 0.94.9-9.el7cp

A osd-drive died some days agoo and after a restart today again with the same error:

= osd.115 10.24.53.152:6807/15666 566 ==== osd_ping(you_died e15168 stamp 2018-09-25 08:55:11.163988) v2 ==== 47+0+0 (923195720 0 0) 0x2ce00f600 con 0x2b95a42c0
-2> 2018-09-25 08:55:12.980462 7fc88a684700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fc7e4492700' had timed out after 15
-1> 2018-09-25 08:55:12.980471 7fc88a684700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fc7e4492700' had suicide timed out after 150
0> 2018-09-25 08:55:12.981683 7fc88a684700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fc88a684700 time 2018-09-25 08:55:12.980485
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

ceph version 0.94.9-9.el7cp (b83334e01379f267fb2f9ce729d74a0a8fa1e92c)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xb12ef5]
2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0xa46399]
3: (ceph::HeartbeatMap::is_healthy()+0xde) [0xa46c8e]
4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0xa473ac]
5: (CephContextServiceThread::entry()+0x15b) [0xb2333b]
6: (()+0x7dc5) [0x7fc88de8ddc5]
7: (clone()+0x6d) [0x7fc88c97073d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.18.log
--
end dump of recent events ---
2018-09-25 08:55:13.015129 7fc88a684700 -1 ** Caught signal (Aborted) *
in thread 7fc88a684700

ceph version 0.94.9-9.el7cp (b83334e01379f267fb2f9ce729d74a0a8fa1e92c)
1: /usr/bin/ceph-osd() [0xa0f922]
2: (()+0xf370) [0x7fc88de95370]
3: (gsignal()+0x37) [0x7fc88c8ae1d7]
4: (abort()+0x148) [0x7fc88c8af8c8]
5: (_gnu_cxx::_verbose_terminate_handler()+0x165) [0x7fc88d1b2ab5]
6: (()+0x5ea26) [0x7fc88d1b0a26]
7: (()+0x5ea53) [0x7fc88d1b0a53]
8: (()+0x5ec73) [0x7fc88d1b0c73]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xb130ea]
10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0xa46399]
11: (ceph::HeartbeatMap::is_healthy()+0xde) [0xa46c8e]
12: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0xa473ac]
13: (CephContextServiceThread::entry()+0x15b) [0xb2333b]
14: (()+0x7dc5) [0x7fc88de8ddc5]
15: (clone()+0x6d) [0x7fc88c97073d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
0> 2018-09-25 08:55:13.015129 7fc88a684700 -1 ** Caught signal (Aborted) *
in thread 7fc88a684700

ceph version 0.94.9-9.el7cp (b83334e01379f267fb2f9ce729d74a0a8fa1e92c)
1: /usr/bin/ceph-osd() [0xa0f922]
2: (()+0xf370) [0x7fc88de95370]
3: (gsignal()+0x37) [0x7fc88c8ae1d7]
4: (abort()+0x148) [0x7fc88c8af8c8]
5: (_gnu_cxx::_verbose_terminate_handler()+0x165) [0x7fc88d1b2ab5]
6: (()+0x5ea26) [0x7fc88d1b0a26]
7: (()+0x5ea53) [0x7fc88d1b0a53]
8: (()+0x5ec73) [0x7fc88d1b0c73]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xb130ea]
10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0xa46399]
11: (ceph::HeartbeatMap::is_healthy()+0xde) [0xa46c8e]
12: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0xa473ac]
13: (CephContextServiceThread::entry()+0x15b) [0xb2333b]
14: (()+0x7dc5) [0x7fc88de8ddc5]
15: (clone()+0x6d) [0x7fc88c97073d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.18.log

Actions #1

Updated by John Spray over 5 years ago

  • Project changed from Ceph to RADOS
  • Subject changed from hit suicide timeout to osd: hit suicide timeout
  • Category deleted (OSD)
Actions #2

Updated by Brad Hubbard over 5 years ago

  • Assignee set to Brad Hubbard

Most likely can't flush filestore output to the hardware. Can you thoroughly check the hardware is in perfect working order? BTW, Hammer has been EOL for over a year.

Actions

Also available in: Atom PDF