Project

General

Profile

Actions

Bug #3378

closed

common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")

Added by Matthew Roy over 11 years ago. Updated over 11 years ago.

Status:
Can't reproduce
Priority:
Low
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This is a cluster of 2 OSDs that is generally unhappy with life. After deleting the cephfs pools the new pool creation took a long time during which system load was fairly low on both OSDs. After trying for a while OSD 0 dies with FAILED assert(0 == "hit suicide timeout"). Eventually pool creation succeeded.

This might be a duplicate of #2784


Files

ceph-osd.0.log-assert_failed (5.02 MB) ceph-osd.0.log-assert_failed Log from OSD Matthew Roy, 10/21/2012 06:14 PM
ceph.conf (1.46 KB) ceph.conf Mark Nelson, 12/25/2012 08:35 PM
osd.4.log.1.gz (1.85 MB) osd.4.log.1.gz Mark Nelson, 12/25/2012 08:35 PM

Updated by Mark Nelson over 11 years ago

Saw this show up during parametric sweep testing on EXT4 with 8 concurrent OSD disk threads. Ceph build is from gitbuilder next branch for bobtail release: 0.55.1-344-g5f25f9f-1precise. Have attached the compressed OSD log and ceph.conf file.

Actions #2

Updated by Sage Weil over 11 years ago

  • Status changed from New to Can't reproduce

The suicide timeout is the symptom only. Usually it means the thread is blocked by a hung syscall. In your case, Matthew, it looks like it was mostly making progress but super slow. Either way, higher filestore logs are needed to see what syscall is blocked, or a core file, or something similar.

Actions

Also available in: Atom PDF