Bug #3801: Cascading OSD failures beginning with common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout") - Ceph - Ceph

Actions

Copy link

Bug #3801

closed

Cascading OSD failures beginning with common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")

Added by Justin Lott over 11 years ago. Updated about 11 years ago.

Status:

Won't Fix

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

0.48.2argonaut
Relevant logs are attached. Core dumps are available if needed.

*** User serially restarts OSDs to deal with horrible memory leak

2013-01-10 21:54:37.747479 7f99d3d6b700 -1 osd.112 84334 *** Got signal Terminated ***
2013-01-10 21:57:13.467199 7f4c04943700 -1 osd.113 84346 *** Got signal Terminated ***
2013-01-10 22:00:15.420213 7f4758062700 -1 osd.114 84356 *** Got signal Terminated ***
2013-01-10 22:03:10.913368 7f9d0b966700 -1 osd.115 84367 *** Got signal Terminated ***
2013-01-10 22:05:40.009230 7fe1aaeb0700 -1 osd.116 84377 *** Got signal Terminated ***
2013-01-10 22:08:09.774829 7febd6174700 -1 osd.117 84392 *** Got signal Terminated ***
2013-01-10 22:10:34.146350 7ffe30bbd700 -1 osd.118 84403 *** Got signal Terminated ***
2013-01-10 22:13:37.574571 7fc4475af700 -1 osd.119 84414 *** Got signal Terminated ***
2013-01-10 22:17:33.053924 7fe568e58700 -1 osd.120 84424 *** Got signal Terminated ***
2013-01-10 22:20:44.449101 7f5484eef700 -1 osd.121 84432 *** Got signal Terminated ***
2013-01-10 22:23:35.501348 7f37a630e700 -1 osd.122 84444 *** Got signal Terminated ***
2013-01-10 22:26:31.196371 7f3574ea0700 -1 osd.123 84455 *** Got signal Terminated ***
2013-01-10 22:29:49.188417 7ff3bd596700 -1 osd.124 84467 *** Got signal Terminated ***
2013-01-10 22:33:45.081867 7ff1f8938700 -1 osd.125 84481 *** Got signal Terminated ***

2013-01-10 22:38:31.063885 osd.125 crashes
    common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
    *** This is a failure in HeartbeatMap::_check(), apparently due to a timeout attempting to 

2013-01-10 22:38:31.343801 osd.124 crashes
    common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")

2013-01-10 22:49:14.799142 osd.40 crashes 
    os/FileStore.cc: 3068: FAILED assert(!m_filestore_fail_eio || got != -5)

2013-01-10 22:55:09.501404 osd.119 crashes
    osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

2013-01-10 22:55:09.518908 osd.123 crashes
    osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

2013-01-10 22:55:11.962220 osd.112 crashes
    osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

2013-01-10 22:55:13.839166 osd.117 crashes
    osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

2013-01-10 23:21:10.243986 osd.112 is manually started

2013-01-10 23:23:51.669043 osd.112 crashes again
    osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

About four hours after that, all down osd's were manually started and stayed up.

Files

Download all files

hpbs-c01-s03.liquidweb.com.tar.gz (14.2 MB) hpbs-c01-s03.liquidweb.com.tar.gz		Justin Lott, 01/15/2013 07:43 AM
hpbs-c01-s09.liquidweb.com.tar.gz (15.3 MB) hpbs-c01-s09.liquidweb.com.tar.gz		Justin Lott, 01/15/2013 07:43 AM
hpbs-c01-s03.liquidweb.com.tar.gz (20.6 MB) hpbs-c01-s03.liquidweb.com.tar.gz		Justin Lott, 01/16/2013 10:28 AM
hpbs-c01-s09.liquidweb.com.tar.gz (23.8 MB) hpbs-c01-s09.liquidweb.com.tar.gz		Justin Lott, 01/16/2013 10:28 AM

Actions

Copy link

Updated by Ian Colle over 11 years ago

Assignee set to Samuel Just
Priority changed from Normal to Urgent

Actions

Copy link

Updated by Sage Weil over 11 years ago

Status changed from New to Need More Info

The osd.40 error means the fs returned EIO on a read operation. Check yoru kern.org.. there is probably a bad disk, or bad btrfs checksum, or something.

The log.head assertions are I believe a bug that has been fixed in bobtail. Sam, can you confirm?

Are osd.124 and osd.125 ont eh same host? Those timeouts generally mean that a system call hung, or something blocked the work queue from making any progress. Most of the time it is a kernel problem (wedged file system or something).. can you check dmesg/kern.log?

Actions

Copy link Download all files

Updated by Justin Lott over 11 years ago

File hpbs-c01-s03.liquidweb.com.tar.gz hpbs-c01-s03.liquidweb.com.tar.gz added
File hpbs-c01-s09.liquidweb.com.tar.gz hpbs-c01-s09.liquidweb.com.tar.gz added

Sage Weil wrote:

The osd.40 error means the fs returned EIO on a read operation. Check yoru kern.org.. there is probably a bad disk, or bad btrfs checksum, or something.

That looks to be the case:
Jan 10 22:49:07 hpbs-c01-s03 kernel: [1228105.311892] btrfs csum failed ino 162509 off 1359872 csum 2259312665 private 3220244801
Jan 10 22:49:14 hpbs-c01-s03 kernel: [1228112.234443] btrfs csum failed ino 162509 off 1359872 csum 2259312665 private 3220244801
Jan 10 22:49:14 hpbs-c01-s03 kernel: [1228112.235144] btrfs csum failed ino 162509 off 1359872 csum 2259312665 private 3220244801

The log.head assertions are I believe a bug that has been fixed in bobtail. Sam, can you confirm?

Are osd.124 and osd.125 ont eh same host? Those timeouts generally mean that a system call hung, or something blocked the work queue from making any progress. Most of the time it is a kernel problem (wedged file system or something).. can you check dmesg/kern.log?

Yes, 124 & 125 are on the same host. The only interesting thing I see around that time on that host is this:
Jan 10 23:25:38 hpbs-c01-s09 kernel: [5031330.611400] btrfs: unlinked 38 orphans

We do see a small number of these across all hosts, but were under the impression it was due to btrfs fragmentation / high iowait (we are currently converting to XFS as quickly as we can):
WARNING: at /build/buildd/linux-3.2.0/fs/btrfs/inode.c:1969 btrfs_orphan_commit_root+0xbb/0xd0 [btrfs]()

I have attached all the logs from the two hosts in question.

Actions

Copy link

Updated by Ian Colle over 11 years ago

Status changed from Need More Info to New

Actions

Copy link

Updated by Sage Weil over 11 years ago

Assignee deleted (~~Samuel Just~~)
Priority changed from Urgent to High

The olog stuff is fixed in bobtail, and won't be backported to argonaut.

I'm not sure what the root cause of hte heartbeat map failures is.. usually it is the disk io subsystem or kernel. If the osds restarted and stayed up, then I don't think there is much to be done. We're focusing our efforts on stabilizing bobtail; I think that will be the best path forward. Please let us know if this comes up again!

Actions

Copy link

Updated by Ian Colle about 11 years ago

Status changed from New to Won't Fix

Fixed in Bobtail, won't backport to Argonaut.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #3801

Cascading OSD failures beginning with common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")

Updated by Ian Colle over 11 years ago

Updated by Sage Weil over 11 years ago

Updated by Justin Lott over 11 years ago

Updated by Ian Colle over 11 years ago

Updated by Sage Weil over 11 years ago

Updated by Ian Colle about 11 years ago