Bug #4552
closedosd: temporarily hung box marks down peers
0%
Description
paravoid reports that a single machine that was being administered via megacli hung for a while, and was marked down, but then managed to mark down most of the rest of the cluster as well.
probably it didn't/needs to notice that it was out of commission for some period and should not take the lack of ping replies over that period seriously.
Files
Updated by Faidon Liambotis about 11 years ago
- File ceph-mon.ms-fe1001.log.bz2 ceph-mon.ms-fe1001.log.bz2 added
- File ceph-osd-tree ceph-osd-tree added
(I'm paravoid -- thanks for opening the bug report)
Attached is the mon log which shows the turn of events quite well. Skip to 15:46:16.832446 for when the fun starts. Attached is also the Ceph OSD tree to make sense of that. The situtation started when running megacli -PDList on ms-be1012, i.e. the box containing osds 132-143, and it hanged for a minute or two, presumably halting I/O to that box completely.
The cluster runs 0.56.3, hadn't had the chance to upgrade to .4 yet.
Updated by Greg Farnum about 11 years ago
Hmm, did you change any of the config around the num_reports stuff?
Or do we in fact have the OSD sending multiple failure reports for each of its peers that quickly. :/
Updated by Faidon Liambotis about 11 years ago
I haven't. The only semi-related config option I have is "mon osd down out interval = 600".
Updated by Ian Colle about 11 years ago
- Assignee set to Sam Lang
- Priority changed from Normal to High
- Source changed from Development to Community (user)
Updated by Ian Colle about 11 years ago
- Assignee changed from Sam Lang to Samuel Just
Updated by Samuel Just almost 11 years ago
- Status changed from In Progress to Resolved
I think the problem was likely caused by a severely backed up heartbeat client dispatch queue. d44cfc524fc0844c6027c586090302d45f360efb should take care of it if this is the case.