Bug #41743
closedLong heartbeat ping times on front interface seen, longest is 2237.999 msec (OSD_SLOW_PING_TIME_FRONT)
0%
Description
"2019-09-09T22:25:11.794749+0000 mon.b (mon.0) 389 : cluster [WRN] Health check failed: Long heartbeat ping times on front interface seen, longest is 2237.999 msec (OSD_SLOW_PING_TIME_FRONT)" in cluster log
on description: rados/verify/{ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-thrash/none.yaml msgr-failures/few.yaml msgr/async-v2only.yaml objectstore/bluestore-comp.yaml rados.yaml tasks/rados_api_tests.yaml validater/valgrind.yaml}
/a/sage-2019-09-09_21:29:24-rados-wip-sage2-testing-2019-09-09-1115-distro-basic-smithi/4293235
Updated by David Zafman over 4 years ago
The only OSDs involved are osd.6 and osd.0.
Slow heartbeat ping on front interface from osd.6 to osd.0 2237.999 msec
It was a spike during a 1 minute average which then got averaged into 5 minute and 15 minute results.
First noticed:
2019-09-09T22:25:11.341+0000 7ff29e65e700 20 mgr.server operator() health checks:
Last seen:
2019-09-09T22:26:13.363+0000 7ff29e65e700 20 mgr.server operator() health checks:
Strange short backup of pings that resolved at 2019-09-09T22:24:33 which could have created a 2 second average as follows:
Approx long pings 12, 9, 7, 7, 3 (total 38 seconds / 16 samples = 2.375)
2019-09-09T22:24:33.523+0000 11486700 25 osd.6 35 handle_osd_ping got all replies from osd.0 , erase pending ping(sent at 2019-09-09T22:24:21.846702+0000) and older pending ping(s)
2019-09-09T22:24:33.542+0000 11486700 25 osd.6 35 handle_osd_ping got all replies from osd.0 , erase pending ping(sent at 2019-09-09T22:24:24.862372+0000) and older pending ping(s)
2019-09-09T22:24:33.547+0000 11486700 25 osd.6 35 handle_osd_ping got all replies from osd.0 , erase pending ping(sent at 2019-09-09T22:24:26.163955+0000) and older pending ping(s)
2019-09-09T22:24:33.550+0000 11486700 25 osd.6 35 handle_osd_ping got all replies from osd.0 , erase pending ping(sent at 2019-09-09T22:24:26.789567+0000) and older pending ping(s)
2019-09-09T22:24:33.553+0000 11486700 25 osd.6 35 handle_osd_ping got all replies from osd.0 , erase pending ping(sent at 2019-09-09T22:24:30.929649+0000) and older pending ping(s)
This may show that a slow reconnect occurring after 2019-09-09T22:24:21 caused ping responses from osd.0 to be delayed to osd.6.
2019-09-09T22:24:21.637+0000 2e758700 10 osd.6 27 heartbeat_reset failed hb con 0x259c4670 for osd.0, reopening
Updated by David Zafman over 4 years ago
Reproduced several times with debug_ms = 20
http://pulpito.ceph.com/dzafman-2019-09-11_15:28:37-rados-wip-zafman-testing3-distro-basic-smithi/
I have been looking at the single job here: http://pulpito.ceph.com/dzafman-2019-09-11_15:06:05-rados-wip-zafman-testing3-distro-basic-smithi/
Updated by David Zafman over 4 years ago
I confused the front and back interface with a retransmit. The ports are the 2 interfaces.
At the ping receiving end osd.6 replied but had to retransmit
2019-09-11T22:37:02.301+0000 11485700 5 --2- v2:172.21.15.13:6815/12578 >> 172.21.15.118:0/13469 conn(0x23e34b10 0x146fb970 crc :-1 s=READY pgs=2 cs=0 l=1 rx=0 tx=0).write_message sending message m=0x1c9d1e90 seq=20 osd_ping(ping_reply e26 up_from 15 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 79.789461809s delta_ub -1.094885368s) v5
6 seconds later from new port 6814/12578 instead of 6815/12578
2019-09-11T22:37:08.960+0000 12487700 5 --2- v2:172.21.15.13:6814/12578 >> 172.21.15.118:0/13469 conn(0x146fe210 0x23e377c0 crc :-1 s=READY pgs=2 cs=0 l=1 rx=0 tx=0).write_message sending message m=0xed23000 seq=20 osd_ping(ping_reply e31 up_from 15 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 85.752623239s delta_ub 4.868276062s) v5
Updated by David Zafman over 4 years ago
This shows the front and back interface. I don't know which is which, but it already sent the second interface maybe 1/2 second after the first. Then didn't get a reply for 6 seconds.
front/back interface 1
2019-09-11T22:37:02.135+0000 12487700 5 --2- 172.21.15.118:0/13469 >> v2:172.21.15.13:6815/12578 conn(0x27cdd130 0x27cdf820 crc :-1 s=READY pgs=6 cs=0 l=1 rx=0 tx=0).write_message sending message m=0x208d4d20 seq=20 osd_ping(ping e26 up_from 14 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 80.884347177s delta_ub 1.278888951s) v5 2019-09-11T22:37:02.503+0000 12487700 5 --2- 172.21.15.118:0/13469 >> v2:172.21.15.13:6815/12578 conn(0x27cdd130 0x27cdf820 crc :-1 s=READ_MESSAGE_COMPLETE pgs=6 cs=0 l=1 rx=0 tx=0).handle_message received message m=0x20a48f20 seq=20 from=osd.6 type=70 osd_ping(ping_reply e26 up_from 15 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 79.789461809s delta_ub -1.094885368s) v5 2019-09-11T22:37:02.504+0000 12487700 1 -- 172.21.15.118:0/13469 <== osd.6 v2:172.21.15.13:6815/12578 20 ==== osd_ping(ping_reply e26 up_from 15 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 79.789461809s delta_ub -1.094885368s) v5 ==== 2033+0+0 (crc 0 0 0) 0x20a48f20 con 0x27cdd130
front/back interface 2
2019-09-11T22:37:03.145+0000 11485700 5 --2- 172.21.15.118:0/13469 >> v2:172.21.15.13:6814/12578 conn(0x27ce2ee0 0x27ce55d0 crc :-1 s=READY pgs=6 cs=0 l=1 rx=0 tx=0).write_message sending message m=0x20432e00 seq=20 osd_ping(ping e26 up_from 14 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 80.884347177s delta_ub 1.278888951s) v5 2019-09-11T22:37:09.988+0000 11485700 5 --2- 172.21.15.118:0/13469 >> v2:172.21.15.13:6814/12578 conn(0x27ce2ee0 0x27ce55d0 crc :-1 s=READ_MESSAGE_COMPLETE pgs=6 cs=0 l=1 rx=0 tx=0).handle_message received message m=0x209fa080 seq=20 from=osd.6 type=70 osd_ping(ping_reply e31 up_from 15 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 85.752623239s delta_ub 4.868276062s) v5 2019-09-11T22:37:09.988+0000 11485700 1 -- 172.21.15.118:0/13469 <== osd.6 v2:172.21.15.13:6814/12578 20 ==== osd_ping(ping_reply e31 up_from 15 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 85.752623239s delta_ub 4.868276062s) v5 ==== 2033+0+0 (crc 0 0 0) 0x209fa080 con 0x27ce2ee0
Updated by David Zafman over 4 years ago
This shows the send on osd.0 and receive at osd.6.
osd.0: 2019-09-11T22:37:02.135+0000 12487700 5 --2- 172.21.15.118:0/13469 >> v2:172.21.15.13:6815/12578 conn(0x27cdd130 0x27cdf820 crc :-1 s=READY pgs=6 cs=0 l=1 rx=0 tx=0).write_message sending message m=0x208d4d20 seq=20 osd_ping(ping e26 up_from 14 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 80.884347177s delta_ub 1.278888951s) v5 osd.6: 2019-09-11T22:37:02.235+0000 11485700 5 --2- v2:172.21.15.13:6815/12578 >> 172.21.15.118:0/13469 conn(0x23e34b10 0x146fb970 crc :-1 s=READ_MESSAGE_COMPLETE pgs=2 cs=0 l=1 rx=0 tx=0).handle_message received message m=0x1c9dc920 seq=20 from=osd.0 type=70 osd_ping(ping e26 up_from 14 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 80.884347177s delta_ub 1.278888951s) v5 osd.0: 2019-09-11T22:37:03.145+0000 11485700 5 --2- 172.21.15.118:0/13469 >> v2:172.21.15.13:6814/12578 conn(0x27ce2ee0 0x27ce55d0 crc :-1 s=READY pgs=6 cs=0 l=1 rx=0 tx=0).write_message sending message m=0x20432e00 seq=20 osd_ping(ping e26 up_from 14 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 80.884347177s delta_ub 1.278888951s) v5 osd.6: 2019-09-11T22:37:08.198+0000 12487700 5 --2- v2:172.21.15.13:6814/12578 >> 172.21.15.118:0/13469 conn(0x146fe210 0x23e377c0 crc :-1 s=READ_MESSAGE_COMPLETE pgs=2 cs=0 l=1 rx=0 tx=0).handle_message received message m=0x23b6ffe0 seq=20 from=osd.0 type=70 osd_ping(ping e26 up_from 14 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 80.884347177s delta_ub 1.278888951s) v5
Updated by David Zafman over 4 years ago
- Related to Feature #40640: Network ping monitoring added
Updated by David Zafman over 4 years ago
- Status changed from 12 to Resolved
- Pull request ID set to 30714
This is already included in back porting of https://tracker.ceph.com/issues/40640
Updated by Nathan Cutler over 4 years ago
David Zafman wrote:
This is already included in back porting of https://tracker.ceph.com/issues/40640
Thanks, David.