Project

General

Profile

Actions

Bug #4552

closed

osd: temporarily hung box marks down peers

Added by Sage Weil about 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

paravoid reports that a single machine that was being administered via megacli hung for a while, and was marked down, but then managed to mark down most of the rest of the cluster as well.

probably it didn't/needs to notice that it was out of commission for some period and should not take the lack of ping replies over that period seriously.


Files

ceph-mon.ms-fe1001.log.bz2 (277 KB) ceph-mon.ms-fe1001.log.bz2 Faidon Liambotis, 03/26/2013 09:05 AM
ceph-osd-tree (3.85 KB) ceph-osd-tree Faidon Liambotis, 03/26/2013 09:05 AM

Updated by Faidon Liambotis about 11 years ago

(I'm paravoid -- thanks for opening the bug report)

Attached is the mon log which shows the turn of events quite well. Skip to 15:46:16.832446 for when the fun starts. Attached is also the Ceph OSD tree to make sense of that. The situtation started when running megacli -PDList on ms-be1012, i.e. the box containing osds 132-143, and it hanged for a minute or two, presumably halting I/O to that box completely.

The cluster runs 0.56.3, hadn't had the chance to upgrade to .4 yet.

Actions #2

Updated by Greg Farnum about 11 years ago

Hmm, did you change any of the config around the num_reports stuff?
Or do we in fact have the OSD sending multiple failure reports for each of its peers that quickly. :/

Actions #3

Updated by Faidon Liambotis about 11 years ago

I haven't. The only semi-related config option I have is "mon osd down out interval = 600".

Actions #4

Updated by Ian Colle about 11 years ago

  • Assignee set to Sam Lang
  • Priority changed from Normal to High
  • Source changed from Development to Community (user)
Actions #5

Updated by Ian Colle about 11 years ago

  • Assignee changed from Sam Lang to Samuel Just
Actions #6

Updated by Samuel Just almost 11 years ago

  • Priority changed from High to Urgent
Actions #7

Updated by Sage Weil almost 11 years ago

  • Status changed from New to In Progress
Actions #8

Updated by Samuel Just almost 11 years ago

  • Status changed from In Progress to Resolved

I think the problem was likely caused by a severely backed up heartbeat client dispatch queue. d44cfc524fc0844c6027c586090302d45f360efb should take care of it if this is the case.

Actions

Also available in: Atom PDF