Bug #12084
heartbeat timed out on OSD
Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
http://pulpito.ceph.com/ubuntu-2015-06-15_22:35:49-rados-wip-kefu-testing---basic-multi/935414/
2015-06-16T09:15:02.813 INFO:tasks.ceph.osd.1.burnupi10.stderr: -77> 2015-06-16 09:15:00.202331 7fcff14a4700 -1 osd.1 1369 heartbeat_check: no reply from osd.4 since back 2015-06-16 09:14:05.095597 front 2015-06-16 09:14:05.095597 (cutoff 2015-06-16 09:14:40.202330) 2015-06-16T09:15:02.814 INFO:tasks.ceph.osd.1.burnupi10.stderr: -31> 2015-06-16 09:15:01.902685 7fcff14a4700 -1 osd.1 1369 heartbeat_check: no reply from osd.4 since back 2015-06-16 09:14:05.095597 front 2015-06-16 09:14:05.095597 (cutoff 2015-06-16 09:14:41.902684) 2015-06-16T09:15:02.814 INFO:tasks.ceph.osd.1.burnupi10.stderr: 0> 2015-06-16 09:15:02.676661 7fd00f34f700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fd00f34f700 time 2015-06-16 09:15:02.674911 2015-06-16T09:15:02.815 INFO:tasks.ceph.osd.1.burnupi10.stderr:common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout") 2015-06-16T09:15:02.815 INFO:tasks.ceph.osd.1.burnupi10.stderr: 2015-06-16T09:15:02.815 INFO:tasks.ceph.osd.1.burnupi10.stderr: ceph version 9.0.1-928-ge381aff (e381aff337ae1e7f2cea4a264bc1d8f46286fe00) 2015-06-16T09:15:02.815 INFO:tasks.ceph.osd.1.burnupi10.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xc45c6b] 2015-06-16T09:15:02.816 INFO:tasks.ceph.osd.1.burnupi10.stderr: 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2a9) [0xb7f7e9] 2015-06-16T09:15:02.816 INFO:tasks.ceph.osd.1.burnupi10.stderr: 3: (ceph::HeartbeatMap::is_healthy()+0xd6) [0xb80076] 2015-06-16T09:15:02.816 INFO:tasks.ceph.osd.1.burnupi10.stderr: 4: (ceph::HeartbeatMap::check_touch_file()+0x17) [0xb80757] 2015-06-16T09:15:02.816 INFO:tasks.ceph.osd.1.burnupi10.stderr: 5: (CephContextServiceThread::entry()+0x154) [0xc5de64] 2015-06-16T09:15:02.817 INFO:tasks.ceph.osd.1.burnupi10.stderr: 6: (()+0x8182) [0x7fd013ef4182] 2015-06-16T09:15:02.817 INFO:tasks.ceph.osd.1.burnupi10.stderr: 7: (clone()+0x6d) [0x7fd01246038d] 2015-06-16T09:15:02.817 INFO:tasks.ceph.osd.1.burnupi10.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2015-06-16T09:15:02.817 INFO:tasks.ceph.osd.1.burnupi10.stderr:
the io exhausted the threads in the worker thread in OSDService.op_wq
, and some dead lock prevented OSD::ShardedOpWQ::_process()
from reseting the heartbeat timeout.
History
#1 Updated by Greg Farnum almost 9 years ago
Are you sure it was a deadlock and not just slowness? If a deadlock, it was probably (hopefully) something in the PRs you were testing...?
#2 Updated by Samuel Just over 8 years ago
- Status changed from New to Can't reproduce