Bug #14341
closedOSD crashed due to suicide timeout
0%
Description
we deployed hammer 0.94.5 on our product environment.
each host run 9 osd instances.
one osd in one host occasionally crashed with following stack information:
-2> 2016-01-08 23:52:52.863213 7f1d1b3ff700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1cc55fe700' had timed out after 15
-1> 2016-01-08 23:52:52.863227 7f1d1b3ff700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1cc55fe700' had suicide timed out after 150
0> 2016-01-08 23:52:52.942734 7f1d1b3ff700 -1 error_msg common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_han
dle_d*, const char*, time_t)' thread 7f1d1b3ff700 time 2016-01-08 23:52:52.863237common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
ceph version 0.94.5-12-g83f56a1 (83f56a1c84e3dbd95a4c394335a7b1dc926dd1c4)
1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x305) [0xa5f215]
2: (ceph::HeartbeatMap::is_healthy()+0xbf) [0xa5f4ff]
3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xa5faf8]
4: (CephContextServiceThread::entry()+0x136) [0xa59c26]
5: /lib64/libpthread.so.0() [0x3f1f807a51]
6: (clone()+0x6d) [0x3f1f4e893d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
there are 10000 lines of recent event logs dumped in osd.log file.
It seems the first log error starting from here:
5901> 2016-01-08 23:50:28.484204 7f1c606eb700 0 - 10.139.204.44:6804/9266 >> 106.120.163.109:0/1024680 pipe(0x7f1d01688000 sd=160 :6804 s=0 pgs=0
cs=0 l=1 c=0x7f1d014a6800).accept replacing existing (lossy) channel (new one lossy=1) 5900> 2016-01-08 23:50:28.484278 7f1cde7ff700 1 osd.48 432 ms_handle_reset con 0x7f1c93417d00 session 0x7f1cbfc0e300 10.139.204.44:6804/9266 >> 106.120.163.109:0/1024680 pipe(0x7f1c934be000 sd=4144 :6804 s=4 pgs=
-5899> 2016-01-08 23:50:28.484280 7f1c757fb700 2 -
6 cs=0 l=1 c=0x7f1c93417d00).reader couldn't read tag, (11) Resource temporarily unavailable
then timed out err log:
-5099> 2016-01-08 23:50:34.087690 7f1cd79f4700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1cc55fe700' had timed out after 15
the whole log file is attached.
after manually restart the troubled osd daemon. things go normal.
need to know what caused the OSD crash.
Updated by cory gu over 8 years ago
try to upload log file. however, it always failed. if anyone need to review the whole log file. just let me know. i will share you.
Updated by Sage Weil over 8 years ago
- Status changed from New to Rejected
it sounds like that one osd has a slow or bad disk. check dmesg for errors.
Updated by cory gu over 8 years ago
We also thought it should be a bad or slow disk issue. However we checked dmesg and didn't find any disk related error. Any other reasons that could cause osd_op_tp timeout?