Project

General

Profile

Bug #41743

Long heartbeat ping times on front interface seen, longest is 2237.999 msec (OSD_SLOW_PING_TIME_FRONT)

Added by Sage Weil 7 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature:

Description

"2019-09-09T22:25:11.794749+0000 mon.b (mon.0) 389 : cluster [WRN] Health check failed: Long heartbeat ping times on front interface seen, longest is 2237.999 msec (OSD_SLOW_PING_TIME_FRONT)" in cluster log

on description: rados/verify/{ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-thrash/none.yaml msgr-failures/few.yaml msgr/async-v2only.yaml objectstore/bluestore-comp.yaml rados.yaml tasks/rados_api_tests.yaml validater/valgrind.yaml}

/a/sage-2019-09-09_21:29:24-rados-wip-sage2-testing-2019-09-09-1115-distro-basic-smithi/4293235


Related issues

Related to RADOS - Feature #40640: Network ping monitoring Resolved 07/02/2019

History

#1 Updated by Neha Ojha 7 months ago

  • Assignee set to David Zafman

#2 Updated by David Zafman 7 months ago

The only OSDs involved are osd.6 and osd.0.

Slow heartbeat ping on front interface from osd.6 to osd.0 2237.999 msec

It was a spike during a 1 minute average which then got averaged into 5 minute and 15 minute results.

First noticed:
2019-09-09T22:25:11.341+0000 7ff29e65e700 20 mgr.server operator() health checks:

Last seen:
2019-09-09T22:26:13.363+0000 7ff29e65e700 20 mgr.server operator() health checks:

Strange short backup of pings that resolved at 2019-09-09T22:24:33 which could have created a 2 second average as follows:
Approx long pings 12, 9, 7, 7, 3 (total 38 seconds / 16 samples = 2.375)
2019-09-09T22:24:33.523+0000 11486700 25 osd.6 35 handle_osd_ping got all replies from osd.0 , erase pending ping(sent at 2019-09-09T22:24:21.846702+0000) and older pending ping(s)
2019-09-09T22:24:33.542+0000 11486700 25 osd.6 35 handle_osd_ping got all replies from osd.0 , erase pending ping(sent at 2019-09-09T22:24:24.862372+0000) and older pending ping(s)
2019-09-09T22:24:33.547+0000 11486700 25 osd.6 35 handle_osd_ping got all replies from osd.0 , erase pending ping(sent at 2019-09-09T22:24:26.163955+0000) and older pending ping(s)
2019-09-09T22:24:33.550+0000 11486700 25 osd.6 35 handle_osd_ping got all replies from osd.0 , erase pending ping(sent at 2019-09-09T22:24:26.789567+0000) and older pending ping(s)
2019-09-09T22:24:33.553+0000 11486700 25 osd.6 35 handle_osd_ping got all replies from osd.0 , erase pending ping(sent at 2019-09-09T22:24:30.929649+0000) and older pending ping(s)

This may show that a slow reconnect occurring after 2019-09-09T22:24:21 caused ping responses from osd.0 to be delayed to osd.6.
2019-09-09T22:24:21.637+0000 2e758700 10 osd.6 27 heartbeat_reset failed hb con 0x259c4670 for osd.0, reopening

#3 Updated by David Zafman 7 months ago

  • Status changed from 12 to In Progress

#4 Updated by David Zafman 7 months ago

  • Status changed from In Progress to 12

#6 Updated by David Zafman 6 months ago

I confused the front and back interface with a retransmit. The ports are the 2 interfaces.

At the ping receiving end osd.6 replied but had to retransmit

2019-09-11T22:37:02.301+0000 11485700 5 --2- v2:172.21.15.13:6815/12578 >> 172.21.15.118:0/13469 conn(0x23e34b10 0x146fb970 crc :-1 s=READY pgs=2 cs=0 l=1 rx=0 tx=0).write_message sending message m=0x1c9d1e90 seq=20 osd_ping(ping_reply e26 up_from 15 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 79.789461809s delta_ub -1.094885368s) v5

6 seconds later from new port 6814/12578 instead of 6815/12578

2019-09-11T22:37:08.960+0000 12487700 5 --2- v2:172.21.15.13:6814/12578 >> 172.21.15.118:0/13469 conn(0x146fe210 0x23e377c0 crc :-1 s=READY pgs=2 cs=0 l=1 rx=0 tx=0).write_message sending message m=0xed23000 seq=20 osd_ping(ping_reply e31 up_from 15 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 85.752623239s delta_ub 4.868276062s) v5

#7 Updated by David Zafman 6 months ago

This shows the front and back interface. I don't know which is which, but it already sent the second interface maybe 1/2 second after the first. Then didn't get a reply for 6 seconds.

front/back interface 1

2019-09-11T22:37:02.135+0000 12487700  5 --2- 172.21.15.118:0/13469 >> v2:172.21.15.13:6815/12578 conn(0x27cdd130 0x27cdf820 crc :-1 s=READY pgs=6 cs=0 l=1 rx=0 tx=0).write_message sending message m=0x208d4d20 seq=20 osd_ping(ping e26 up_from 14 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 80.884347177s delta_ub 1.278888951s) v5

2019-09-11T22:37:02.503+0000 12487700  5 --2- 172.21.15.118:0/13469 >> v2:172.21.15.13:6815/12578 conn(0x27cdd130 0x27cdf820 crc :-1 s=READ_MESSAGE_COMPLETE pgs=6 cs=0 l=1 rx=0 tx=0).handle_message received message m=0x20a48f20 seq=20 from=osd.6 type=70 osd_ping(ping_reply e26 up_from 15 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 79.789461809s delta_ub -1.094885368s) v5
2019-09-11T22:37:02.504+0000 12487700  1 -- 172.21.15.118:0/13469 <== osd.6 v2:172.21.15.13:6815/12578 20 ==== osd_ping(ping_reply e26 up_from 15 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 79.789461809s delta_ub -1.094885368s) v5 ==== 2033+0+0 (crc 0 0 0) 0x20a48f20 con 0x27cdd130

front/back interface 2

2019-09-11T22:37:03.145+0000 11485700  5 --2- 172.21.15.118:0/13469 >> v2:172.21.15.13:6814/12578 conn(0x27ce2ee0 0x27ce55d0 crc :-1 s=READY pgs=6 cs=0 l=1 rx=0 tx=0).write_message sending message m=0x20432e00 seq=20 osd_ping(ping e26 up_from 14 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 80.884347177s delta_ub 1.278888951s) v5

2019-09-11T22:37:09.988+0000 11485700  5 --2- 172.21.15.118:0/13469 >> v2:172.21.15.13:6814/12578 conn(0x27ce2ee0 0x27ce55d0 crc :-1 s=READ_MESSAGE_COMPLETE pgs=6 cs=0 l=1 rx=0 tx=0).handle_message received message m=0x209fa080 seq=20 from=osd.6 type=70 osd_ping(ping_reply e31 up_from 15 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 85.752623239s delta_ub 4.868276062s) v5
2019-09-11T22:37:09.988+0000 11485700  1 -- 172.21.15.118:0/13469 <== osd.6 v2:172.21.15.13:6814/12578 20 ==== osd_ping(ping_reply e31 up_from 15 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 85.752623239s delta_ub 4.868276062s) v5 ==== 2033+0+0 (crc 0 0 0) 0x209fa080 con 0x27ce2ee0

#8 Updated by David Zafman 6 months ago

This shows the send on osd.0 and receive at osd.6.

osd.0: 2019-09-11T22:37:02.135+0000 12487700  5 --2- 172.21.15.118:0/13469 >> v2:172.21.15.13:6815/12578 conn(0x27cdd130 0x27cdf820 crc :-1 s=READY pgs=6 cs=0 l=1 rx=0 tx=0).write_message sending message m=0x208d4d20 seq=20 osd_ping(ping e26 up_from 14 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 80.884347177s delta_ub 1.278888951s) v5
osd.6: 2019-09-11T22:37:02.235+0000 11485700  5 --2- v2:172.21.15.13:6815/12578 >> 172.21.15.118:0/13469 conn(0x23e34b10 0x146fb970 crc :-1 s=READ_MESSAGE_COMPLETE pgs=2 cs=0 l=1 rx=0 tx=0).handle_message received message m=0x1c9dc920 seq=20 from=osd.0 type=70 osd_ping(ping e26 up_from 14 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 80.884347177s delta_ub 1.278888951s) v5

osd.0: 2019-09-11T22:37:03.145+0000 11485700  5 --2- 172.21.15.118:0/13469 >> v2:172.21.15.13:6814/12578 conn(0x27ce2ee0 0x27ce55d0 crc :-1 s=READY pgs=6 cs=0 l=1 rx=0 tx=0).write_message sending message m=0x20432e00 seq=20 osd_ping(ping e26 up_from 14 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 80.884347177s delta_ub 1.278888951s) v5
osd.6: 2019-09-11T22:37:08.198+0000 12487700  5 --2- v2:172.21.15.13:6814/12578 >> 172.21.15.118:0/13469 conn(0x146fe210 0x23e377c0 crc :-1 s=READ_MESSAGE_COMPLETE pgs=2 cs=0 l=1 rx=0 tx=0).handle_message received message m=0x23b6ffe0 seq=20 from=osd.0 type=70 osd_ping(ping e26 up_from 14 ping_stamp 2019-09-11T22:37:02.086496+0000/80.884347177s send_stamp 80.884347177s delta_ub 1.278888951s) v5

#9 Updated by David Zafman 6 months ago

#10 Updated by David Zafman 6 months ago

  • Status changed from 12 to Resolved
  • Pull request ID set to 30714

This is already included in back porting of https://tracker.ceph.com/issues/40640

#11 Updated by Nathan Cutler 6 months ago

David Zafman wrote:

This is already included in back porting of https://tracker.ceph.com/issues/40640

Thanks, David.

Also available in: Atom PDF