Project

General

Profile

Actions

Bug #9844

closed

"initiating reconnect" (log) race; crash of multiple OSDs (domino effect)

Added by Dmitry Smirnov over 9 years ago. Updated over 9 years ago.

Status:
Won't Fix
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On 0.87 I watch "ceph osd tree" and notice that one OSD (leveldb/keyvaluestore-dev) is "down".
In its log I see

2014-10-21 16:46:03.621206 7f616f341700  0 -- 192.168.0.2:6815/2002172 >> 192.168.0.250:6820/20594 pipe(0x7f61cd947080 sd=20 :36994 s=2 pgs=58604 cs=34763 l=0 c=0x7f61dbded760).fault, initiating reconnect
2014-10-21 16:46:03.622258 7f616f341700  0 -- 192.168.0.2:6815/2002172 >> 192.168.0.250:6820/20594 pipe(0x7f61cd947080 sd=20 :36995 s=2 pgs=58605 cs=34765 l=0 c=0x7f61dbded760).fault, initiating reconnect

logged over and over again at the rate ~1000 lines per second (first problem).
Shortly logging is stopped with

2014-10-21 16:46:07.414672 7f616f341700  0 -- 192.168.0.2:6815/2002172 >> 192.168.0.250:6820/20594 pipe(0x7f61cd947080 sd=20 :37002 s=2 pgs=58612 cs=34779 l=0 c=0x7f61dbded760).fault with nothing to send, going to standby

Then a minute later thousands of "initiating reconnect" lines is logged again followed by "fault with nothing to send, going to standby".
Here I notice 6 other OSDs down on 5 different hosts, all died at the same time (second problem) with something like the following in their logs:

   -10> 2014-10-21 16:59:51.883155 7f2a8abd2700  1 heartbeat_map is_healthy 'OSD::recovery_tp thread 0x7f2a6a8e9700' had timed out after 30
    -9> 2014-10-21 16:59:51.883167 7f2a8abd2700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f2a6b0ea700' had timed out after 15
    -8> 2014-10-21 16:59:51.883169 7f2a8abd2700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f2a6b8eb700' had timed out after 15
    -7> 2014-10-21 16:59:51.883171 7f2a8abd2700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f2a6c8ed700' had timed out after 15
    -6> 2014-10-21 16:59:51.883172 7f2a8abd2700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f2a6e8f1700' had timed out after 15
    -5> 2014-10-21 16:59:51.883174 7f2a8abd2700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f2a6f0f2700' had timed out after 15
    -4> 2014-10-21 16:59:51.883175 7f2a8abd2700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f2a6f8f3700' had timed out after 15
    -3> 2014-10-21 16:59:56.883261 7f2a8abd2700  1 heartbeat_map is_healthy 'OSD::recovery_tp thread 0x7f2a6a8e9700' had timed out after 30
    -2> 2014-10-21 16:59:56.883277 7f2a8abd2700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f2a6b0ea700' had timed out after 15
    -1> 2014-10-21 16:59:56.883281 7f2a8abd2700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f2a6b0ea700' had suicide timed out aft
er 150
     0> 2014-10-21 16:59:56.885246 7f2a8abd2700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_hand
le_d*, const char*, time_t)' thread 7f2a8abd2700 time 2014-10-21 16:59:56.883294
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

Over the last few days I've seen this pattern (one OSD dies knocking down several others on different hosts) two or three times during backfill but it is not clear how/when/why it begins -- for hours everything works well until all of a sudden an "OSD chain reaction" happens...
First OSD is the only one with LevelDB backend.

Actions

Also available in: Atom PDF