Project

General

Profile

Actions

Bug #3801

closed

Cascading OSD failures beginning with common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")

Added by Justin Lott over 11 years ago. Updated about 11 years ago.

Status:
Won't Fix
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

0.48.2argonaut
Relevant logs are attached. Core dumps are available if needed.

*** User serially restarts OSDs to deal with horrible memory leak

2013-01-10 21:54:37.747479 7f99d3d6b700 -1 osd.112 84334 *** Got signal Terminated ***
2013-01-10 21:57:13.467199 7f4c04943700 -1 osd.113 84346 *** Got signal Terminated ***
2013-01-10 22:00:15.420213 7f4758062700 -1 osd.114 84356 *** Got signal Terminated ***
2013-01-10 22:03:10.913368 7f9d0b966700 -1 osd.115 84367 *** Got signal Terminated ***
2013-01-10 22:05:40.009230 7fe1aaeb0700 -1 osd.116 84377 *** Got signal Terminated ***
2013-01-10 22:08:09.774829 7febd6174700 -1 osd.117 84392 *** Got signal Terminated ***
2013-01-10 22:10:34.146350 7ffe30bbd700 -1 osd.118 84403 *** Got signal Terminated ***
2013-01-10 22:13:37.574571 7fc4475af700 -1 osd.119 84414 *** Got signal Terminated ***
2013-01-10 22:17:33.053924 7fe568e58700 -1 osd.120 84424 *** Got signal Terminated ***
2013-01-10 22:20:44.449101 7f5484eef700 -1 osd.121 84432 *** Got signal Terminated ***
2013-01-10 22:23:35.501348 7f37a630e700 -1 osd.122 84444 *** Got signal Terminated ***
2013-01-10 22:26:31.196371 7f3574ea0700 -1 osd.123 84455 *** Got signal Terminated ***
2013-01-10 22:29:49.188417 7ff3bd596700 -1 osd.124 84467 *** Got signal Terminated ***
2013-01-10 22:33:45.081867 7ff1f8938700 -1 osd.125 84481 *** Got signal Terminated ***

2013-01-10 22:38:31.063885 osd.125 crashes
    common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
    *** This is a failure in HeartbeatMap::_check(), apparently due to a timeout attempting to 

2013-01-10 22:38:31.343801 osd.124 crashes
    common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")

2013-01-10 22:49:14.799142 osd.40 crashes 
    os/FileStore.cc: 3068: FAILED assert(!m_filestore_fail_eio || got != -5)

2013-01-10 22:55:09.501404 osd.119 crashes
    osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

2013-01-10 22:55:09.518908 osd.123 crashes
    osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

2013-01-10 22:55:11.962220 osd.112 crashes
    osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

2013-01-10 22:55:13.839166 osd.117 crashes
    osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

2013-01-10 23:21:10.243986 osd.112 is manually started

2013-01-10 23:23:51.669043 osd.112 crashes again
    osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

About four hours after that, all down osd's were manually started and stayed up.


Files

Actions #1

Updated by Ian Colle over 11 years ago

  • Assignee set to Samuel Just
  • Priority changed from Normal to Urgent
Actions #2

Updated by Sage Weil over 11 years ago

  • Status changed from New to Need More Info

The osd.40 error means the fs returned EIO on a read operation. Check yoru kern.org.. there is probably a bad disk, or bad btrfs checksum, or something.

The log.head assertions are I believe a bug that has been fixed in bobtail. Sam, can you confirm?

Are osd.124 and osd.125 ont eh same host? Those timeouts generally mean that a system call hung, or something blocked the work queue from making any progress. Most of the time it is a kernel problem (wedged file system or something).. can you check dmesg/kern.log?

Updated by Justin Lott over 11 years ago

Sage Weil wrote:

The osd.40 error means the fs returned EIO on a read operation. Check yoru kern.org.. there is probably a bad disk, or bad btrfs checksum, or something.

That looks to be the case:
Jan 10 22:49:07 hpbs-c01-s03 kernel: [1228105.311892] btrfs csum failed ino 162509 off 1359872 csum 2259312665 private 3220244801
Jan 10 22:49:14 hpbs-c01-s03 kernel: [1228112.234443] btrfs csum failed ino 162509 off 1359872 csum 2259312665 private 3220244801
Jan 10 22:49:14 hpbs-c01-s03 kernel: [1228112.235144] btrfs csum failed ino 162509 off 1359872 csum 2259312665 private 3220244801

The log.head assertions are I believe a bug that has been fixed in bobtail. Sam, can you confirm?

Are osd.124 and osd.125 ont eh same host? Those timeouts generally mean that a system call hung, or something blocked the work queue from making any progress. Most of the time it is a kernel problem (wedged file system or something).. can you check dmesg/kern.log?

Yes, 124 & 125 are on the same host. The only interesting thing I see around that time on that host is this:
Jan 10 23:25:38 hpbs-c01-s09 kernel: [5031330.611400] btrfs: unlinked 38 orphans

We do see a small number of these across all hosts, but were under the impression it was due to btrfs fragmentation / high iowait (we are currently converting to XFS as quickly as we can):
WARNING: at /build/buildd/linux-3.2.0/fs/btrfs/inode.c:1969 btrfs_orphan_commit_root+0xbb/0xd0 [btrfs]()

I have attached all the logs from the two hosts in question.

Actions #4

Updated by Ian Colle over 11 years ago

  • Status changed from Need More Info to New
Actions #5

Updated by Sage Weil over 11 years ago

  • Assignee deleted (Samuel Just)
  • Priority changed from Urgent to High

The olog stuff is fixed in bobtail, and won't be backported to argonaut.

I'm not sure what the root cause of hte heartbeat map failures is.. usually it is the disk io subsystem or kernel. If the osds restarted and stayed up, then I don't think there is much to be done. We're focusing our efforts on stabilizing bobtail; I think that will be the best path forward. Please let us know if this comes up again!

Actions #6

Updated by Ian Colle about 11 years ago

  • Status changed from New to Won't Fix

Fixed in Bobtail, won't backport to Argonaut.

Actions

Also available in: Atom PDF