Bug #23371
openOSDs flaps when cluster network is made down
0%
Description
we are having a 5 node cluster with 5 mons and 120 OSDs equally distributed.
As part of our resiliency test we made cluster network of one node down. OSDs of that node are not down immediately, it flapping. OSDs which marked down are booting back up. it is taking too much time for all OSDs to go down and during this entire period, ceph is not able to write anything.
This issue is faced only in Luminous.
Attaching ceph.conf
Files
Updated by Greg Farnum about 6 years ago
- Project changed from Ceph to RADOS
You tested this on a version prior to luminous and the behavior has changed?
This must be a result of some change to heartbeating and how it handles cluster versus public network results, but I don't think any of that has changed in several years...
Updated by Nokia ceph-users almost 6 years ago
we have not observed this behavior in kraken.
when ever the Cluster interface is made down, few OSDs which goes down complaints to mon saying 'log_channel(cluster) log [DBG] : map e88934 wrongly marked me down at e88934'. So active mon boots that OSD back.
cn6.chn6us1c1.cdn ~# ceph daemon /var/run/ceph/ceph-osd.9.asok config show | grep heart
"debug_heartbeatmap": "0/0",
"heartbeat_file": "",
"heartbeat_inject_failure": "0",
"heartbeat_interval": "5",
"mon_osd_adjust_heartbeat_grace": "false",
"osd_heartbeat_addr": "-",
"osd_heartbeat_grace": "25",
"osd_heartbeat_interval": "6",
"osd_heartbeat_min_healthy_ratio": "0.330000",
"osd_heartbeat_min_peers": "10",
"osd_heartbeat_min_size": "2000",
"osd_heartbeat_use_min_delay_socket": "false",
"osd_mon_heartbeat_interval": "30",
"rbd_mirror_leader_heartbeat_interval": "5",
"rbd_mirror_leader_max_missed_heartbeats": "2",
Can we change any of the heartbeat parameter to come out if this issue??