OSD logs:-
6-09-16 11:41:38.708334 7f9e52e24700 1 osd.19 5702 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:40:12.495354 front 2016-09-16 11:40:12.495354 (cutoff 2016-09-16 11:41:18.708281)
-298> 2016-09-16 11:41:41.008734 7f9e52e24700 -1 osd.19 5702 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:40:12.495354 front 2016-09-16 11:40:12.495354 (cutoff 2016-09-16 11:41:21.008732)
-297> 2016-09-16 11:41:44.509101 7f9e52e24700 -1 osd.19 5702 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:40:12.495354 front 2016-09-16 11:40:12.495354 (cutoff 2016-09-16 11:41:24.509100)
-296> 2016-09-16 11:41:49.209487 7f9e52e24700 -1 osd.19 5702 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:40:12.495354 front 2016-09-16 11:40:12.495354 (cutoff 2016-09-16 11:41:29.209484)
-295> 2016-09-16 11:41:50.309867 7f9e52e24700 -1 osd.19 5702 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:40:12.495354 front 2016-09-16 11:40:12.495354 (cutoff 2016-09-16 11:41:30.309863)
-294> 2016-09-16 11:41:51.410233 7f9e52e24700 -1 osd.19 5702 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:40:12.495354 front 2016-09-16 11:40:12.495354 (cutoff 2016-09-16 11:41:31.410231)
-293> 2016-09-16 11:42:05.560936 7f9e216ac700 0 - 10.242.43.102:6807/3203977 >> 10.242.43.114:6824/15400084 pipe(0x7f9eb0320000 sd=489 :6807 s=0 pgs=0 cs=0 l=0 c=0x7f9ecebb5900).accept we reset (peer sent cseq 1), sending RESETSESSION
292> 2016-09-16 11:42:09.148128 7f9e216ac700 0 - 10.242.43.102:6807/3203977 >> 10.242.43.114:6824/15400084 pipe(0x7f9eb0320000 sd=489 :6807 s=2 pgs=4502 cs=1 l=0 c=0x7f9ecebb5900).reader missed message? skipped from seq 0 to 1244008781
291> 2016-09-16 11:42:09.148639 7f9e216ac700 0 - 10.242.43.102:6807/3203977 >> 10.242.43.114:6824/15400084 pipe(0x7f9ec41e8800 sd=489 :6807 s=0 pgs=0 cs=0 l=0 c=0x7f9ecebb5a80).accept we reset (peer sent cseq 2), sending RESETSESSION
290> 2016-09-16 11:42:09.155042 7f9e216ac700 0 - 10.242.43.102:6807/3203977 >> 10.242.43.114:6824/15400084 pipe(0x7f9ec41e8800 sd=489 :6807 s=2 pgs=4638 cs=1 l=0 c=0x7f9ecebb5a80).reader missed message? skipped from seq 0 to 203218437
289> 2016-09-16 11:42:09.155509 7f9e216ac700 0 - 10.242.43.102:6807/3203977 >> 10.242.43.114:6824/15400084 pipe(0x7f9efb7be800 sd=489 :6807 s=0 pgs=0 cs=0 l=0 c=0x7f9ea68d5000).accept we reset (peer sent cseq 2), sending RESETSESSION
2016-09-16 11:45:17.681698 7f9e5ee3c700 0 -- 10.242.43.102:6803/3203977 submit_message osd_op_reply(18415972 10000178daf.0000021b [write 0~4194304] v5700'18885 uv18885 ondisk = 0) v7 remote, 10.242.43.106:0/198074132, failed lossy con, dropping message 0x7f9ec2cc5c80
138> 2016-09-16 11:45:17.681822 7f9e5ee3c700 0 - 10.242.43.102:6803/3203977 submit_message osd_op_reply(18416016 10000178da9.0000021b [write 0~4194304] v5700'18886 uv18886 ondisk = 0) v7 remote, 10.242.43.106:0/198074132, failed lossy con, dropping message 0x7f9ec2cc3b80
137> 2016-09-16 11:45:18.015692 7f9e7065f700 0 - 10.242.43.102:6803/3203977 submit_message osd_op_reply(18418981 10000178daa.00000380 [write 0~4194304] v5702'19322 uv19322 ondisk = 0) v7 remote, 10.242.43.106:0/198074132, failed lossy con, dropping message 0x7f9f15bc3b80
2016-09-16 11:46:32.056220 7f9e3b7c9700 0 -- 10.242.43.102:6807/3203977 >> 10.242.43.116:6825/5154101 pipe(0x7f9eb021d400 sd=263 :58135 s=1 pgs=2950 cs=2 l=0 c=0x7f9eaa92d000).connect got RESETSESSION
5> 2016-09-16 11:46:32.056265 7f9e4b4dc700 0 - 10.242.43.102:6807/3203977 >> 10.242.43.103:6805/2659717 pipe(0x7f9eafe8e000 sd=233 :37457 s=1 pgs=6785 cs=2 l=0 c=0x7f9eafe57500).connect got RESETSESSION
4> 2016-09-16 11:46:32.457419 7f9e357d9700 0 - 10.242.43.102:6807/3203977 >> 10.242.43.116:6805/4158656 pipe(0x7f9eb6dc5400 sd=154 :52223 s=2 pgs=3268 cs=1 l=0 c=0x7f9ea9f4fb00).fault, initiating reconnect
3> 2016-09-16 11:46:32.457777 7f9e3c023700 0 - 10.242.43.102:6807/3203977 >> 10.242.43.116:6805/4158656 pipe(0x7f9eb6dc5400 sd=154 :52236 s=1 pgs=3268 cs=2 l=0 c=0x7f9ea9f4fb00).connect got RESETSESSION
2> 2016-09-16 11:46:32.470500 7f9e357d9700 0 - 10.242.43.102:6807/3203977 >> 10.242.43.116:6805/4158656 pipe(0x7f9eb6dc5400 sd=154 :52236 s=2 pgs=3269 cs=1 l=0 c=0x7f9ea9f4fb00).fault, initiating reconnect
1> 2016-09-16 11:46:32.470963 7f9e3c023700 0 - 10.242.43.102:6807/3203977 >> 10.242.43.116:6805/4158656 pipe(0x7f9eb6dc5400 sd=154 :52237 s=1 pgs=3269 cs=2 l=0 c=0x7f9ea9f4fb00).connect got RESETSESSION
0> 2016-09-16 11:47:53.371813 7f9e95757700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f9e95757700 time 2016-09-16 11:47:52.820218
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
ceph version 10.2.2-CEPH-1.4.0.10 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f9e9b2759cb]
2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2b1) [0x7f9e9b1b9271]
3: (ceph::HeartbeatMap::is_healthy()+0xc6) [0x7f9e9b1b9a36]
4: (ceph::HeartbeatMap::check_touch_file()+0x17) [0x7f9e9b1ba227]
5: (CephContextServiceThread::entry()+0x14b) [0x7f9e9b28d11b]
6: (()+0x8184) [0x7f9e99317184]
7: (clone()+0x6d) [0x7f9e972f137d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 0 lockdep
0/ 0 context
0/ 0 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 0 buffer
0/ 0 timer
0/ 0 filer
0/ 1 striper
0/ 0 objecter
0/ 0 rados
0/ 0 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 0 journaler
0/ 5 objectcacher
0/ 0 client
0/ 0 osd
0/ 0 optracker
0/ 0 objclass
0/ 0 filestore
0/ 0 journal
0/ 0 ms
0/ 0 mon
0/ 0 monc
0/ 0 paxos
0/ 0 tp
0/ 0 auth
1/ 5 crypto
0/ 0 finisher
0/ 0 heartbeatmap
0/ 0 perfcounter
0/ 0 rgw
1/10 civetweb
1/ 5 javaclient
0/ 0 asok
0/ 0 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 newstore
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
1/ 5 kinetic
1/ 5 fuse
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.19.log
-- end dump of recent events ---
2016-09-16 11:47:57.841882 7f9e52e24700 -1 osd.19 5713 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:47:36.437667 front 2016-09-16 11:47:36.437667 (cutoff 2016-09-16 11:47:37.841880)
2016-09-16 11:48:03.742529 7f9e52e24700 -1 osd.19 5713 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:47:36.437667 front 2016-09-16 11:47:36.437667 (cutoff 2016-09-16 11:47:43.742526)
2016-09-16 11:48:04.243002 7f9e52e24700 -1 osd.19 5713 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:47:36.437667 front 2016-09-16 11:47:36.437667 (cutoff 2016-09-16 11:47:44.243000)
2016-09-16 11:48:08.943473 7f9e52e24700 -1 osd.19 5713 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:47:36.437667 front 2016-09-16 11:47:36.437667 (cutoff 2016-09-16 11:47:48.943468)
2016-09-16 11:48:13.043954 7f9e52e24700 -1 osd.19 5713 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:47:36.437667 front 2016-09-16 11:47:36.437667 (cutoff 2016-09-16 11:47:53.043928)
2016-09-16 11:48:44.025447 7f9e95757700 -1 ** Caught signal (Aborted) *
in thread 7f9e95757700 thread_name:service
ceph version 10.2.2-CEPH-1.4.0.10 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (()+0x8fd7f2) [0x7f9e9b17c7f2]
2: (()+0x10330) [0x7f9e9931f330]
3: (gsignal()+0x37) [0x7f9e9722dc37]
4: (abort()+0x148) [0x7f9e97231028]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x7f9e9b275ba5]
6: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2b1) [0x7f9e9b1b9271]
7: (ceph::HeartbeatMap::is_healthy()+0xc6) [0x7f9e9b1b9a36]
8: (ceph::HeartbeatMap::check_touch_file()+0x17) [0x7f9e9b1ba227]
9: (CephContextServiceThread::entry()+0x14b) [0x7f9e9b28d11b]
10: (()+0x8184) [0x7f9e99317184]
11: (clone()+0x6d) [0x7f9e972f137d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- begin dump of recent events ---
-5> 2016-09-16 11:47:57.841882 7f9e52e24700 -1 osd.19 5713 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:47:36.437667 front 2016-09-16 11:47:36.437667 (cutoff 2016-09-16 11:47:37.841880)
-4> 2016-09-16 11:48:03.742529 7f9e52e24700 -1 osd.19 5713 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:47:36.437667 front 2016-09-16 11:47:36.437667 (cutoff 2016-09-16 11:47:43.742526)
-3> 2016-09-16 11:48:04.243002 7f9e52e24700 -1 osd.19 5713 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:47:36.437667 front 2016-09-16 11:47:36.437667 (cutoff 2016-09-16 11:47:44.243000)
-2> 2016-09-16 11:48:08.943473 7f9e52e24700 -1 osd.19 5713 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:47:36.437667 front 2016-09-16 11:47:36.437667 (cutoff 2016-09-16 11:47:48.943468)
-1> 2016-09-16 11:48:13.043954 7f9e52e24700 -1 osd.19 5713 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:47:36.437667 front 2016-09-16 11:47:36.437667 (cutoff 2016-09-16 11:47:53.043928)
0> 2016-09-16 11:48:44.025447 7f9e95757700 -1 ** Caught signal (Aborted) *
in thread 7f9e95757700 thread_name:service
ceph version 10.2.2-CEPH-1.4.0.10 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (()+0x8fd7f2) [0x7f9e9b17c7f2]
2: (()+0x10330) [0x7f9e9931f330]
3: (gsignal()+0x37) [0x7f9e9722dc37]
4: (abort()+0x148) [0x7f9e97231028]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x7f9e9b275ba5]
6: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2b1) [0x7f9e9b1b9271]
7: (ceph::HeartbeatMap::is_healthy()+0xc6) [0x7f9e9b1b9a36]
8: (ceph::HeartbeatMap::check_touch_file()+0x17) [0x7f9e9b1ba227]
9: (CephContextServiceThread::entry()+0x14b) [0x7f9e9b28d11b]
10: (()+0x8184) [0x7f9e99317184]
11: (clone()+0x6d) [0x7f9e972f137d]