Project

General

Profile

Actions

Bug #3801

closed

Cascading OSD failures beginning with common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")

Added by Justin Lott over 11 years ago. Updated about 11 years ago.

Status:
Won't Fix
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

0.48.2argonaut
Relevant logs are attached. Core dumps are available if needed.

*** User serially restarts OSDs to deal with horrible memory leak

2013-01-10 21:54:37.747479 7f99d3d6b700 -1 osd.112 84334 *** Got signal Terminated ***
2013-01-10 21:57:13.467199 7f4c04943700 -1 osd.113 84346 *** Got signal Terminated ***
2013-01-10 22:00:15.420213 7f4758062700 -1 osd.114 84356 *** Got signal Terminated ***
2013-01-10 22:03:10.913368 7f9d0b966700 -1 osd.115 84367 *** Got signal Terminated ***
2013-01-10 22:05:40.009230 7fe1aaeb0700 -1 osd.116 84377 *** Got signal Terminated ***
2013-01-10 22:08:09.774829 7febd6174700 -1 osd.117 84392 *** Got signal Terminated ***
2013-01-10 22:10:34.146350 7ffe30bbd700 -1 osd.118 84403 *** Got signal Terminated ***
2013-01-10 22:13:37.574571 7fc4475af700 -1 osd.119 84414 *** Got signal Terminated ***
2013-01-10 22:17:33.053924 7fe568e58700 -1 osd.120 84424 *** Got signal Terminated ***
2013-01-10 22:20:44.449101 7f5484eef700 -1 osd.121 84432 *** Got signal Terminated ***
2013-01-10 22:23:35.501348 7f37a630e700 -1 osd.122 84444 *** Got signal Terminated ***
2013-01-10 22:26:31.196371 7f3574ea0700 -1 osd.123 84455 *** Got signal Terminated ***
2013-01-10 22:29:49.188417 7ff3bd596700 -1 osd.124 84467 *** Got signal Terminated ***
2013-01-10 22:33:45.081867 7ff1f8938700 -1 osd.125 84481 *** Got signal Terminated ***

2013-01-10 22:38:31.063885 osd.125 crashes
    common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
    *** This is a failure in HeartbeatMap::_check(), apparently due to a timeout attempting to 

2013-01-10 22:38:31.343801 osd.124 crashes
    common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")

2013-01-10 22:49:14.799142 osd.40 crashes 
    os/FileStore.cc: 3068: FAILED assert(!m_filestore_fail_eio || got != -5)

2013-01-10 22:55:09.501404 osd.119 crashes
    osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

2013-01-10 22:55:09.518908 osd.123 crashes
    osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

2013-01-10 22:55:11.962220 osd.112 crashes
    osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

2013-01-10 22:55:13.839166 osd.117 crashes
    osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

2013-01-10 23:21:10.243986 osd.112 is manually started

2013-01-10 23:23:51.669043 osd.112 crashes again
    osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

About four hours after that, all down osd's were manually started and stayed up.


Files

Actions

Also available in: Atom PDF