Project

General

Profile

Feature #20087

OSD: Add heartbeat message for Jumbo Frames(MTU 9000)

Added by Vikhyat Umrao 5 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
05/25/2017
Due date:
% Done:

0%

Source:
Support
Tags:
Backport:
jewel
Reviewed:
User Impact:
Affected Versions:
Release:
jewel
Needs Doc:
No

Description

- OSD: Add heartbeat message for Jumbo Frames(MTU 9000)

- When we have jumbo frames enabled in cluster network and if MTU is not configured properly like the recommendation is all interconnecting network gear must also have jumbo frames enabled but if any device is misconfigured for jumbo frames then we see a lot of issues like peering stuck, slow requests and backfilling not progressing.

- And the issue is we do not see heartbeat timeout messages in the OSD logs because heartbeat messages packet size is below 1500.

- We checked the communication issue with below command:

# ping -W 2 -I <interface> -M do -s <pkt size> <IP address>

Downstream feature request: https://bugzilla.redhat.com/show_bug.cgi?id=1455711


Related issues

Duplicates Ceph - Feature #18438: Configurable OSD Heartbeat packet size (MTU) New 01/06/2017
Copied to Ceph - Backport #20353: jewel: OSD: Add heartbeat message for Jumbo Frames(MTU 9000) Resolved

History

#1 Updated by Vikhyat Umrao 5 months ago

  • Description updated (diff)

#2 Updated by Vikhyat Umrao 5 months ago

We have another feature request: http://tracker.ceph.com/issues/18438 for Configurable OSD Heartbeat packet size (MTU) for same issue.

#3 Updated by Vikhyat Umrao 5 months ago

  • Subject changed from OSD: Add heartbeat message for Jumbo Frames(MTU 900) to OSD: Add heartbeat message for Jumbo Frames(MTU 9000)
  • Description updated (diff)

#4 Updated by Greg Farnum 5 months ago

  • Duplicates Feature #18438: Configurable OSD Heartbeat packet size (MTU) added

#5 Updated by Greg Farnum 5 months ago

I've seen stuff about this before but not been entirely clear on what's happening. Is the issue that the local box is configured for jumbo frames but the switch silently drops them? I'm wondering if there's something Ceph can query to know if it needs to do this validation.

I suppose we can inflate the heartbeat packets with a zero-filled bufferlist or something. Should we do that for every heartbeat? I suppose a 9KB packet that gets thrown away isn't that much wasted network bandwidth...

#6 Updated by Vikhyat Umrao 5 months ago

  • Description updated (diff)

Greg Farnum wrote:

Thanks Greg for your inputs.

I've seen stuff about this before but not been entirely clear on what's happening. Is the issue that the local box is configured for jumbo frames but the switch silently drops them? I'm wondering if there's something Ceph can query to know if it needs to do this validation.

Yes. This was the case. Local was having MTU configured as 9000 and there was some issue at switch layer configuration for 9000 MTU and osd does not log about heartbeat failures.

I suppose we can inflate the heartbeat packets with a zero-filled bufferlist or something. Should we do that for every heartbeat? I suppose a 9KB packet that gets thrown away isn't that much wasted network bandwidth...

Yep. Yesterday I had a quick discussion with Josh before creating this feature request and we agreed that a feature for periodically sending a larger request to detect that MTU issue would be great. Maybe we can choose periodically? `osd_heartbeat_interval` default is 6 seconds maybe we can choose even number packets?

#7 Updated by Greg Farnum 5 months ago

  • Status changed from New to Testing
  • Assignee set to Greg Farnum

#8 Updated by Vikhyat Umrao 5 months ago

#10 Updated by Vikhyat Umrao 4 months ago

  • Backport set to jewel

#11 Updated by Greg Farnum 4 months ago

  • Status changed from Testing to Pending Backport

Note to backporters: consider whatever happens with https://github.com/ceph/ceph/pull/15727 !

#12 Updated by Nathan Cutler 4 months ago

  • Copied to Backport #20353: jewel: OSD: Add heartbeat message for Jumbo Frames(MTU 9000) added

#13 Updated by Vikhyat Umrao 4 months ago

Greg Farnum wrote:

Note to backporters: consider whatever happens with https://github.com/ceph/ceph/pull/15727 !

Thanks Greg. I have assigned the backport to myself. I will keep tracking of 15727 and will take action according to that.

#14 Updated by Nathan Cutler about 2 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF