Project

General

Profile

Actions

Bug #23371

open

OSDs flaps when cluster network is made down

Added by Nokia ceph-users about 6 years ago. Updated almost 6 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

we are having a 5 node cluster with 5 mons and 120 OSDs equally distributed.

As part of our resiliency test we made cluster network of one node down. OSDs of that node are not down immediately, it flapping. OSDs which marked down are booting back up. it is taking too much time for all OSDs to go down and during this entire period, ceph is not able to write anything.

This issue is faced only in Luminous.

Attaching ceph.conf


Files

ceph.conf (3.01 KB) ceph.conf Nokia ceph-users, 03/15/2018 05:20 AM
Actions #1

Updated by Greg Farnum about 6 years ago

  • Project changed from Ceph to RADOS

You tested this on a version prior to luminous and the behavior has changed?

This must be a result of some change to heartbeating and how it handles cluster versus public network results, but I don't think any of that has changed in several years...

Actions #2

Updated by Nokia ceph-users almost 6 years ago

we have not observed this behavior in kraken.

when ever the Cluster interface is made down, few OSDs which goes down complaints to mon saying 'log_channel(cluster) log [DBG] : map e88934 wrongly marked me down at e88934'. So active mon boots that OSD back.

cn6.chn6us1c1.cdn ~# ceph daemon /var/run/ceph/ceph-osd.9.asok config show | grep heart
"debug_heartbeatmap": "0/0",
"heartbeat_file": "",
"heartbeat_inject_failure": "0",
"heartbeat_interval": "5",
"mon_osd_adjust_heartbeat_grace": "false",
"osd_heartbeat_addr": "-",
"osd_heartbeat_grace": "25",
"osd_heartbeat_interval": "6",
"osd_heartbeat_min_healthy_ratio": "0.330000",
"osd_heartbeat_min_peers": "10",
"osd_heartbeat_min_size": "2000",
"osd_heartbeat_use_min_delay_socket": "false",
"osd_mon_heartbeat_interval": "30",
"rbd_mirror_leader_heartbeat_interval": "5",
"rbd_mirror_leader_max_missed_heartbeats": "2",

Can we change any of the heartbeat parameter to come out if this issue??

Actions

Also available in: Atom PDF