Project

General

Profile

Bug #12084

heartbeat timed out on OSD

Added by Kefu Chai almost 9 years ago. Updated over 8 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

http://pulpito.ceph.com/ubuntu-2015-06-15_22:35:49-rados-wip-kefu-testing---basic-multi/935414/

2015-06-16T09:15:02.813 INFO:tasks.ceph.osd.1.burnupi10.stderr:   -77> 2015-06-16 09:15:00.202331 7fcff14a4700 -1 osd.1 1369 heartbeat_check: no reply from osd.4 since back 2015-06-16 09:14:05.095597 front 2015-06-16 09:14:05.095597 (cutoff 2015-06-16 09:14:40.202330)
2015-06-16T09:15:02.814 INFO:tasks.ceph.osd.1.burnupi10.stderr:   -31> 2015-06-16 09:15:01.902685 7fcff14a4700 -1 osd.1 1369 heartbeat_check: no reply from osd.4 since back 2015-06-16 09:14:05.095597 front 2015-06-16 09:14:05.095597 (cutoff 2015-06-16 09:14:41.902684)
2015-06-16T09:15:02.814 INFO:tasks.ceph.osd.1.burnupi10.stderr:     0> 2015-06-16 09:15:02.676661 7fd00f34f700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fd00f34f700 time 2015-06-16 09:15:02.674911
2015-06-16T09:15:02.815 INFO:tasks.ceph.osd.1.burnupi10.stderr:common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
2015-06-16T09:15:02.815 INFO:tasks.ceph.osd.1.burnupi10.stderr:
2015-06-16T09:15:02.815 INFO:tasks.ceph.osd.1.burnupi10.stderr: ceph version 9.0.1-928-ge381aff (e381aff337ae1e7f2cea4a264bc1d8f46286fe00)
2015-06-16T09:15:02.815 INFO:tasks.ceph.osd.1.burnupi10.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xc45c6b]
2015-06-16T09:15:02.816 INFO:tasks.ceph.osd.1.burnupi10.stderr: 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2a9) [0xb7f7e9]
2015-06-16T09:15:02.816 INFO:tasks.ceph.osd.1.burnupi10.stderr: 3: (ceph::HeartbeatMap::is_healthy()+0xd6) [0xb80076]
2015-06-16T09:15:02.816 INFO:tasks.ceph.osd.1.burnupi10.stderr: 4: (ceph::HeartbeatMap::check_touch_file()+0x17) [0xb80757]
2015-06-16T09:15:02.816 INFO:tasks.ceph.osd.1.burnupi10.stderr: 5: (CephContextServiceThread::entry()+0x154) [0xc5de64]
2015-06-16T09:15:02.817 INFO:tasks.ceph.osd.1.burnupi10.stderr: 6: (()+0x8182) [0x7fd013ef4182]
2015-06-16T09:15:02.817 INFO:tasks.ceph.osd.1.burnupi10.stderr: 7: (clone()+0x6d) [0x7fd01246038d]
2015-06-16T09:15:02.817 INFO:tasks.ceph.osd.1.burnupi10.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2015-06-16T09:15:02.817 INFO:tasks.ceph.osd.1.burnupi10.stderr:

the io exhausted the threads in the worker thread in OSDService.op_wq, and some dead lock prevented OSD::ShardedOpWQ::_process() from reseting the heartbeat timeout.

History

#1 Updated by Greg Farnum almost 9 years ago

Are you sure it was a deadlock and not just slowness? If a deadlock, it was probably (hopefully) something in the PRs you were testing...?

#2 Updated by Samuel Just over 8 years ago

  • Status changed from New to Can't reproduce

Also available in: Atom PDF