Bug #14341: OSD crashed due to suicide timeout - Ceph - Ceph

Actions

Copy link

Bug #14341

closed

OSD crashed due to suicide timeout

Added by cory gu over 8 years ago. Updated over 8 years ago.

Status:

Rejected

Priority:

High

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

we deployed hammer 0.94.5 on our product environment.
each host run 9 osd instances.
one osd in one host occasionally crashed with following stack information:

-2> 2016-01-08 23:52:52.863213 7f1d1b3ff700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1cc55fe700' had timed out after 15
    -1> 2016-01-08 23:52:52.863227 7f1d1b3ff700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1cc55fe700' had suicide timed out after 150
     0> 2016-01-08 23:52:52.942734 7f1d1b3ff700 -1 error_msg common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_han
dle_d*, const char*, time_t)' thread 7f1d1b3ff700 time 2016-01-08 23:52:52.863237common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

ceph version 0.94.5-12-g83f56a1 (83f56a1c84e3dbd95a4c394335a7b1dc926dd1c4)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x305) [0xa5f215]
 2: (ceph::HeartbeatMap::is_healthy()+0xbf) [0xa5f4ff]
 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xa5faf8]
 4: (CephContextServiceThread::entry()+0x136) [0xa59c26]
 5: /lib64/libpthread.so.0() [0x3f1f807a51]
 6: (clone()+0x6d) [0x3f1f4e893d]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- logging levels ---
0/ 5 none

there are 10000 lines of recent event logs dumped in osd.log file.
It seems the first log error starting from here:
~~5901> 2016-01-08 23:50:28.484204 7f1c606eb700 0 -~~ 10.139.204.44:6804/9266 >> 106.120.163.109:0/1024680 pipe(0x7f1d01688000 sd=160 :6804 s=0 pgs=0
cs=0 l=1 c=0x7f1d014a6800).accept replacing existing (lossy) channel (new one lossy=1) 5900> 2016-01-08 23:50:28.484278 7f1cde7ff700 1 osd.48 432 ms_handle_reset con 0x7f1c93417d00 session 0x7f1cbfc0e300
-5899> 2016-01-08 23:50:28.484280 7f1c757fb700 2 - 10.139.204.44:6804/9266 >> 106.120.163.109:0/1024680 pipe(0x7f1c934be000 sd=4144 :6804 s=4 pgs=
6 cs=0 l=1 c=0x7f1c93417d00).reader couldn't read tag, (11) Resource temporarily unavailable

then timed out err log:
-5099> 2016-01-08 23:50:34.087690 7f1cd79f4700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1cc55fe700' had timed out after 15

the whole log file is attached.

after manually restart the troubled osd daemon. things go normal.
need to know what caused the OSD crash.

Actions

Copy link

Updated by cory gu over 8 years ago

try to upload log file. however, it always failed. if anyone need to review the whole log file. just let me know. i will share you.

Actions

Copy link

Updated by Sage Weil over 8 years ago

Status changed from New to Rejected

it sounds like that one osd has a slow or bad disk. check dmesg for errors.

Actions

Copy link

Updated by cory gu over 8 years ago

We also thought it should be a bad or slow disk issue. However we checked dmesg and didn't find any disk related error. Any other reasons that could cause osd_op_tp timeout?

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #14341

OSD crashed due to suicide timeout

Updated by cory gu over 8 years ago

Updated by Sage Weil over 8 years ago

Updated by cory gu over 8 years ago