Actions
Bug #12328
closedCeph KV OSD crashes sometimes with FAILED assert(0 == "hit suicide timeout") on 0.94.1
Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
When rsyncing data to cephfs , sometimes one or more osds crashes.
Content of log file of such an osd:
-12> 2015-07-14 13:29:04.693090 7f6743db0700 0 log_channel(cluster) log [WRN] : slow request 142.985014 seconds old, received at 2015-07-14 13:26:41.707961: osd_repop(osd.1.0:474288 1.20 99cd4c20/10000017a0f.00000000/head//1 v 122'5460) currently no flag points reached -11> 2015-07-14 13:29:04.693093 7f6743db0700 0 log_channel(cluster) log [WRN] : slow request 142.833813 seconds old, received at 2015-07-14 13:26:41.859161: osd_repop(osd.33.0:493429 1.63 78aa6c63/1000001ee19.00000000/head//1 v 122'5731) currently started -10> 2015-07-14 13:29:04.930965 7f6736bfc700 1 -- 10.143.8.181:6803/19927 <== osd.19 10.143.8.181:0/21497 105431 ==== osd_ping(ping e122 stamp 2015-07-14 13:29:04.930745) v2 ==== 47+0+0 (1630228558 0 0) 0x6c47000 con 0x57eef60 -9> 2015-07-14 13:29:04.930960 7f67383ff700 1 -- 10.141.8.181:6803/19927 <== osd.19 10.143.8.181:0/21497 105431 ==== osd_ping(ping e122 stamp 2015-07-14 13:29:04.930745) v2 ==== 47+0+0 (1630228558 0 0) 0xdf34400 con 0x57ec200 -8> 2015-07-14 13:29:04.931016 7f6736bfc700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f672fbee700' had timed out after 15 -7> 2015-07-14 13:29:04.931020 7f6736bfc700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f67313f1700' had timed out after 15 -6> 2015-07-14 13:29:04.931033 7f67383ff700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f672fbee700' had timed out after 15 -5> 2015-07-14 13:29:04.931037 7f67383ff700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f67313f1700' had timed out after 15 -4> 2015-07-14 13:29:05.205493 7f67383ff700 1 -- 10.141.8.181:6803/19927 <== osd.18 10.143.8.181:0/20704 105272 ==== osd_ping(ping e122 stamp 2015-07-14 13:29:05.205277) v2 ==== 47+0+0 (628769321 0 0) 0x7283400 con 0x57e82c0 -3> 2015-07-14 13:29:05.205508 7f6736bfc700 1 -- 10.143.8.181:6803/19927 <== osd.18 10.143.8.181:0/20704 105272 ==== osd_ping(ping e122 stamp 2015-07-14 13:29:05.205277) v2 ==== 47+0+0 (628769321 0 0) 0x7ce9a00 con 0x57e8160 -2> 2015-07-14 13:29:05.205522 7f67383ff700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f672fbee700' had timed out after 15 -1> 2015-07-14 13:29:05.205526 7f67383ff700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f672fbee700' had suicide timed out after 150 0> 2015-07-14 13:29:05.209756 7f67383ff700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f67383ff700 time 2015-07-14 13:29:05.205531 common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout") ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc51f5] 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0xafaff9] 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0xafb8ee] 4: (OSD::handle_osd_ping(MOSDPing*)+0x733) [0x695f13] 5: (OSD::heartbeat_dispatch(Message*)+0x2fb) [0x69718b] 6: (DispatchQueue::entry()+0x62a) [0xc7cc4a] 7: (DispatchQueue::DispatchThread::entry()+0xd) [0xba403d] 8: (()+0x7df5) [0x7f674c587df5] 9: (clone()+0x6d) [0x7f674b04e1ad] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. ... 2015-07-14 13:29:05.271802 7f67383ff700 -1 *** Caught signal (Aborted) ** in thread 7f67383ff700 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: /usr/bin/ceph-osd() [0xac51f2] 2: (()+0xf130) [0x7f674c58f130] 3: (gsignal()+0x37) [0x7f674af8d5d7] 4: (abort()+0x148) [0x7f674af8ecc8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f674b8a09b5] 6: (()+0x5e926) [0x7f674b89e926] 7: (()+0x5e953) [0x7f674b89e953] 8: (()+0x5eb73) [0x7f674b89eb73] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xbc53ea] 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0xafaff9] 11: (ceph::HeartbeatMap::is_healthy()+0xde) [0xafb8ee] 12: (OSD::handle_osd_ping(MOSDPing*)+0x733) [0x695f13] 13: (OSD::heartbeat_dispatch(Message*)+0x2fb) [0x69718b] 14: (DispatchQueue::entry()+0x62a) [0xc7cc4a] 15: (DispatchQueue::DispatchThread::entry()+0xd) [0xba403d] 16: (()+0x7df5) [0x7f674c587df5] 17: (clone()+0x6d) [0x7f674b04e1ad] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- 0> 2015-07-14 13:29:05.271802 7f67383ff700 -1 *** Caught signal (Aborted) ** in thread 7f67383ff700 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: /usr/bin/ceph-osd() [0xac51f2] 2: (()+0xf130) [0x7f674c58f130] 3: (gsignal()+0x37) [0x7f674af8d5d7] 4: (abort()+0x148) [0x7f674af8ecc8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f674b8a09b5] 6: (()+0x5e926) [0x7f674b89e926] 7: (()+0x5e953) [0x7f674b89e953] 8: (()+0x5eb73) [0x7f674b89eb73] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xbc53ea] 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0xafaff9] 11: (ceph::HeartbeatMap::is_healthy()+0xde) [0xafb8ee] 12: (OSD::handle_osd_ping(MOSDPing*)+0x733) [0x695f13] 13: (OSD::heartbeat_dispatch(Message*)+0x2fb) [0x69718b] 14: (DispatchQueue::entry()+0x62a) [0xc7cc4a] 15: (DispatchQueue::DispatchThread::entry()+0xd) [0xba403d] 16: (()+0x7df5) [0x7f674c587df5] 17: (clone()+0x6d) [0x7f674b04e1ad]
OSD can be restarted normally.
Actions