Actions
Bug #3801
closedCascading OSD failures beginning with common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
Status:
Won't Fix
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
0.48.2argonaut
Relevant logs are attached. Core dumps are available if needed.
*** User serially restarts OSDs to deal with horrible memory leak 2013-01-10 21:54:37.747479 7f99d3d6b700 -1 osd.112 84334 *** Got signal Terminated *** 2013-01-10 21:57:13.467199 7f4c04943700 -1 osd.113 84346 *** Got signal Terminated *** 2013-01-10 22:00:15.420213 7f4758062700 -1 osd.114 84356 *** Got signal Terminated *** 2013-01-10 22:03:10.913368 7f9d0b966700 -1 osd.115 84367 *** Got signal Terminated *** 2013-01-10 22:05:40.009230 7fe1aaeb0700 -1 osd.116 84377 *** Got signal Terminated *** 2013-01-10 22:08:09.774829 7febd6174700 -1 osd.117 84392 *** Got signal Terminated *** 2013-01-10 22:10:34.146350 7ffe30bbd700 -1 osd.118 84403 *** Got signal Terminated *** 2013-01-10 22:13:37.574571 7fc4475af700 -1 osd.119 84414 *** Got signal Terminated *** 2013-01-10 22:17:33.053924 7fe568e58700 -1 osd.120 84424 *** Got signal Terminated *** 2013-01-10 22:20:44.449101 7f5484eef700 -1 osd.121 84432 *** Got signal Terminated *** 2013-01-10 22:23:35.501348 7f37a630e700 -1 osd.122 84444 *** Got signal Terminated *** 2013-01-10 22:26:31.196371 7f3574ea0700 -1 osd.123 84455 *** Got signal Terminated *** 2013-01-10 22:29:49.188417 7ff3bd596700 -1 osd.124 84467 *** Got signal Terminated *** 2013-01-10 22:33:45.081867 7ff1f8938700 -1 osd.125 84481 *** Got signal Terminated *** 2013-01-10 22:38:31.063885 osd.125 crashes common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout") *** This is a failure in HeartbeatMap::_check(), apparently due to a timeout attempting to 2013-01-10 22:38:31.343801 osd.124 crashes common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout") 2013-01-10 22:49:14.799142 osd.40 crashes os/FileStore.cc: 3068: FAILED assert(!m_filestore_fail_eio || got != -5) 2013-01-10 22:55:09.501404 osd.119 crashes osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >= log.tail) 2013-01-10 22:55:09.518908 osd.123 crashes osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >= log.tail) 2013-01-10 22:55:11.962220 osd.112 crashes osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >= log.tail) 2013-01-10 22:55:13.839166 osd.117 crashes osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >= log.tail) 2013-01-10 23:21:10.243986 osd.112 is manually started 2013-01-10 23:23:51.669043 osd.112 crashes again osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)
About four hours after that, all down osd's were manually started and stayed up.
Files
Actions