Project

General

Profile

Actions

Bug #12328

closed

Ceph KV OSD crashes sometimes with FAILED assert(0 == "hit suicide timeout") on 0.94.1

Added by Kenneth Waegeman almost 9 years ago. Updated about 7 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When rsyncing data to cephfs , sometimes one or more osds crashes.
Content of log file of such an osd:

   -12> 2015-07-14 13:29:04.693090 7f6743db0700  0 log_channel(cluster) log [WRN] : slow request 142.985014 seconds old, received at 2015-07-14 13:26:41.707961: osd_repop(osd.1.0:474288 1.20 99cd4c20/10000017a0f.00000000/head//1 v 122'5460) currently no flag points reached
   -11> 2015-07-14 13:29:04.693093 7f6743db0700  0 log_channel(cluster) log [WRN] : slow request 142.833813 seconds old, received at 2015-07-14 13:26:41.859161: osd_repop(osd.33.0:493429 1.63 78aa6c63/1000001ee19.00000000/head//1 v 122'5731) currently started
   -10> 2015-07-14 13:29:04.930965 7f6736bfc700  1 -- 10.143.8.181:6803/19927 <== osd.19 10.143.8.181:0/21497 105431 ==== osd_ping(ping e122 stamp 2015-07-14 13:29:04.930745) v2 ==== 47+0+0 (1630228558 0 0) 0x6c47000 con 0x57eef60
    -9> 2015-07-14 13:29:04.930960 7f67383ff700  1 -- 10.141.8.181:6803/19927 <== osd.19 10.143.8.181:0/21497 105431 ==== osd_ping(ping e122 stamp 2015-07-14 13:29:04.930745) v2 ==== 47+0+0 (1630228558 0 0) 0xdf34400 con 0x57ec200
    -8> 2015-07-14 13:29:04.931016 7f6736bfc700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f672fbee700' had timed out after 15
    -7> 2015-07-14 13:29:04.931020 7f6736bfc700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f67313f1700' had timed out after 15
    -6> 2015-07-14 13:29:04.931033 7f67383ff700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f672fbee700' had timed out after 15
    -5> 2015-07-14 13:29:04.931037 7f67383ff700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f67313f1700' had timed out after 15
    -4> 2015-07-14 13:29:05.205493 7f67383ff700  1 -- 10.141.8.181:6803/19927 <== osd.18 10.143.8.181:0/20704 105272 ==== osd_ping(ping e122 stamp 2015-07-14 13:29:05.205277) v2 ==== 47+0+0 (628769321 0 0) 0x7283400 con 0x57e82c0
    -3> 2015-07-14 13:29:05.205508 7f6736bfc700  1 -- 10.143.8.181:6803/19927 <== osd.18 10.143.8.181:0/20704 105272 ==== osd_ping(ping e122 stamp 2015-07-14 13:29:05.205277) v2 ==== 47+0+0 (628769321 0 0) 0x7ce9a00 con 0x57e8160
    -2> 2015-07-14 13:29:05.205522 7f67383ff700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f672fbee700' had timed out after 15
    -1> 2015-07-14 13:29:05.205526 7f67383ff700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f672fbee700' had suicide timed out after 150
     0> 2015-07-14 13:29:05.209756 7f67383ff700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f67383ff700 time 2015-07-14 13:29:05.205531
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc51f5]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0xafaff9]
 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0xafb8ee]
 4: (OSD::handle_osd_ping(MOSDPing*)+0x733) [0x695f13]
 5: (OSD::heartbeat_dispatch(Message*)+0x2fb) [0x69718b]
 6: (DispatchQueue::entry()+0x62a) [0xc7cc4a]
 7: (DispatchQueue::DispatchThread::entry()+0xd) [0xba403d]
 8: (()+0x7df5) [0x7f674c587df5]
 9: (clone()+0x6d) [0x7f674b04e1ad]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

...
2015-07-14 13:29:05.271802 7f67383ff700 -1 *** Caught signal (Aborted) **
 in thread 7f67383ff700

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: /usr/bin/ceph-osd() [0xac51f2]
 2: (()+0xf130) [0x7f674c58f130]
 3: (gsignal()+0x37) [0x7f674af8d5d7]
 4: (abort()+0x148) [0x7f674af8ecc8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f674b8a09b5]
 6: (()+0x5e926) [0x7f674b89e926]
 7: (()+0x5e953) [0x7f674b89e953]
 8: (()+0x5eb73) [0x7f674b89eb73]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xbc53ea]
 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0xafaff9]
 11: (ceph::HeartbeatMap::is_healthy()+0xde) [0xafb8ee]
 12: (OSD::handle_osd_ping(MOSDPing*)+0x733) [0x695f13]
 13: (OSD::heartbeat_dispatch(Message*)+0x2fb) [0x69718b]
 14: (DispatchQueue::entry()+0x62a) [0xc7cc4a]
 15: (DispatchQueue::DispatchThread::entry()+0xd) [0xba403d]
 16: (()+0x7df5) [0x7f674c587df5]
 17: (clone()+0x6d) [0x7f674b04e1ad]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2015-07-14 13:29:05.271802 7f67383ff700 -1 *** Caught signal (Aborted) **
 in thread 7f67383ff700

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: /usr/bin/ceph-osd() [0xac51f2]
 2: (()+0xf130) [0x7f674c58f130]
 3: (gsignal()+0x37) [0x7f674af8d5d7]
 4: (abort()+0x148) [0x7f674af8ecc8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f674b8a09b5]
 6: (()+0x5e926) [0x7f674b89e926]
 7: (()+0x5e953) [0x7f674b89e953]
 8: (()+0x5eb73) [0x7f674b89eb73]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xbc53ea]
 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0xafaff9]
 11: (ceph::HeartbeatMap::is_healthy()+0xde) [0xafb8ee]
 12: (OSD::handle_osd_ping(MOSDPing*)+0x733) [0x695f13]
 13: (OSD::heartbeat_dispatch(Message*)+0x2fb) [0x69718b]
 14: (DispatchQueue::entry()+0x62a) [0xc7cc4a]
 15: (DispatchQueue::DispatchThread::entry()+0xd) [0xba403d]
 16: (()+0x7df5) [0x7f674c587df5]
 17: (clone()+0x6d) [0x7f674b04e1ad]

OSD can be restarted normally.

Actions

Also available in: Atom PDF