Bug #4552: osd: temporarily hung box marks down peers - Ceph - Ceph

Actions

Copy link

Bug #4552

closed

osd: temporarily hung box marks down peers

Added by Sage Weil about 11 years ago. Updated almost 11 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Samuel Just

Category:

OSD

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

paravoid reports that a single machine that was being administered via megacli hung for a while, and was marked down, but then managed to mark down most of the rest of the cluster as well.

probably it didn't/needs to notice that it was out of commission for some period and should not take the lack of ping replies over that period seriously.

Files

Download all files

ceph-mon.ms-fe1001.log.bz2 (277 KB) ceph-mon.ms-fe1001.log.bz2		Faidon Liambotis, 03/26/2013 09:05 AM
ceph-osd-tree (3.85 KB) ceph-osd-tree		Faidon Liambotis, 03/26/2013 09:05 AM

Actions

Copy link Download all files

Updated by Faidon Liambotis about 11 years ago

File ceph-mon.ms-fe1001.log.bz2 ceph-mon.ms-fe1001.log.bz2 added
File ceph-osd-tree ceph-osd-tree added

(I'm paravoid -- thanks for opening the bug report)

Attached is the mon log which shows the turn of events quite well. Skip to 15:46:16.832446 for when the fun starts. Attached is also the Ceph OSD tree to make sense of that. The situtation started when running megacli -PDList on ms-be1012, i.e. the box containing osds 132-143, and it hanged for a minute or two, presumably halting I/O to that box completely.

The cluster runs 0.56.3, hadn't had the chance to upgrade to .4 yet.

Actions

Copy link

Updated by Greg Farnum about 11 years ago

Hmm, did you change any of the config around the num_reports stuff?
Or do we in fact have the OSD sending multiple failure reports for each of its peers that quickly. :/

Actions

Copy link

Updated by Faidon Liambotis about 11 years ago

I haven't. The only semi-related config option I have is "mon osd down out interval = 600".

Actions

Copy link

Updated by Ian Colle about 11 years ago

Assignee set to Sam Lang
Priority changed from Normal to High
Source changed from Development to Community (user)

Actions

Copy link

Updated by Ian Colle about 11 years ago

Assignee changed from Sam Lang to Samuel Just

Actions

Copy link

Updated by Samuel Just almost 11 years ago

Priority changed from High to Urgent

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Samuel Just almost 11 years ago

Status changed from In Progress to Resolved

I think the problem was likely caused by a severely backed up heartbeat client dispatch queue. d44cfc524fc0844c6027c586090302d45f360efb should take care of it if this is the case.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #4552

osd: temporarily hung box marks down peers

Updated by Faidon Liambotis about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Faidon Liambotis about 11 years ago

Updated by Ian Colle about 11 years ago

Updated by Ian Colle about 11 years ago

Updated by Samuel Just almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Samuel Just almost 11 years ago