Feature #47519
Gracefully detect MTU mismatch
Status:
New
Priority:
Normal
Assignee:
-
Category:
Administration/Usability
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Component(RADOS):
Pull request ID:
Description
I ran into an issue this morning of flapping OSDs. The cluster_network
NICs were set to MTU=9000
but the switch ports were not configured for that.
Here's some example health output
[root@extensa003 ~]# ceph -s cluster: id: 802e797b-1ad1-4ddc-9363-1dfb16507a26 health: HEALTH_WARN Long heartbeat ping times on back interface seen, longest is 567363.257 msec Long heartbeat ping times on front interface seen, longest is 566150.752 msec Reduced data availability: 26 pgs inactive, 1 pg down, 24 pgs peering 1301 slow ops, oldest one blocked for 5757 sec, mon.extensa003 has slow ops clock skew detected on mon.extensa004, mon.extensa005 services: mon: 3 daemons, quorum extensa003,extensa004,extensa005 (age 5h) mgr: extensa005(active, since 5h), standbys: extensa004, extensa003 osd: 96 osds: 96 up (since 0.981233s), 96 in (since 7h) data: pools: 1 pools, 32 pgs objects: 0 objects, 0 B usage: 7.2 TiB used, 698 TiB / 706 TiB avail pgs: 87.500% pgs not active 26 peering 4 active+clean 1 activating+undersized 1 down
As a suggestion, it might be helpful for the OSDs to try to ping each other using different packet sizes and warn the user if there might be an MTU mismatch.
History
#1 Updated by Josh Durgin over 3 years ago
- Project changed from Ceph to RADOS
- Category changed from OSD to Administration/Usability
Good suggestion, the pings are with large sizes already, but we don't particularly warn about MTU right now, you just end up with osds marked down.