Bug #4552
closed
osd: temporarily hung box marks down peers
Added by Sage Weil about 11 years ago.
Updated about 11 years ago.
Description
paravoid reports that a single machine that was being administered via megacli hung for a while, and was marked down, but then managed to mark down most of the rest of the cluster as well.
probably it didn't/needs to notice that it was out of commission for some period and should not take the lack of ping replies over that period seriously.
Files
(I'm paravoid -- thanks for opening the bug report)
Attached is the mon log which shows the turn of events quite well. Skip to 15:46:16.832446 for when the fun starts. Attached is also the Ceph OSD tree to make sense of that. The situtation started when running megacli -PDList on ms-be1012, i.e. the box containing osds 132-143, and it hanged for a minute or two, presumably halting I/O to that box completely.
The cluster runs 0.56.3, hadn't had the chance to upgrade to .4 yet.
Hmm, did you change any of the config around the num_reports stuff?
Or do we in fact have the OSD sending multiple failure reports for each of its peers that quickly. :/
I haven't. The only semi-related config option I have is "mon osd down out interval = 600".
- Assignee set to Sam Lang
- Priority changed from Normal to High
- Source changed from Development to Community (user)
- Assignee changed from Sam Lang to Samuel Just
- Priority changed from High to Urgent
- Status changed from New to In Progress
- Status changed from In Progress to Resolved
I think the problem was likely caused by a severely backed up heartbeat client dispatch queue. d44cfc524fc0844c6027c586090302d45f360efb should take care of it if this is the case.
Also available in: Atom
PDF