Bug #23371: OSDs flaps when cluster network is made down - RADOS - Ceph

Actions

Copy link

Bug #23371

open

OSDs flaps when cluster network is made down

Added by Nokia ceph-users about 6 years ago. Updated almost 6 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v12.2.4

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

we are having a 5 node cluster with 5 mons and 120 OSDs equally distributed.

As part of our resiliency test we made cluster network of one node down. OSDs of that node are not down immediately, it flapping. OSDs which marked down are booting back up. it is taking too much time for all OSDs to go down and during this entire period, ceph is not able to write anything.

This issue is faced only in Luminous.

Attaching ceph.conf

Files

ceph.conf (3.01 KB) ceph.conf

Nokia ceph-users, 03/15/2018 05:20 AM

Actions

Copy link

Updated by Greg Farnum about 6 years ago

Project changed from Ceph to RADOS

You tested this on a version prior to luminous and the behavior has changed?

This must be a result of some change to heartbeating and how it handles cluster versus public network results, but I don't think any of that has changed in several years...

Actions

Copy link

Updated by Nokia ceph-users almost 6 years ago

we have not observed this behavior in kraken.

when ever the Cluster interface is made down, few OSDs which goes down complaints to mon saying 'log_channel(cluster) log [DBG] : map e88934 wrongly marked me down at e88934'. So active mon boots that OSD back.

cn6.chn6us1c1.cdn ~# ceph daemon /var/run/ceph/ceph-osd.9.asok config show | grep heart
"debug_heartbeatmap": "0/0",
"heartbeat_file": "",
"heartbeat_inject_failure": "0",
"heartbeat_interval": "5",
"mon_osd_adjust_heartbeat_grace": "false",
"osd_heartbeat_addr": "-",
"osd_heartbeat_grace": "25",
"osd_heartbeat_interval": "6",
"osd_heartbeat_min_healthy_ratio": "0.330000",
"osd_heartbeat_min_peers": "10",
"osd_heartbeat_min_size": "2000",
"osd_heartbeat_use_min_delay_socket": "false",
"osd_mon_heartbeat_interval": "30",
"rbd_mirror_leader_heartbeat_interval": "5",
"rbd_mirror_leader_max_missed_heartbeats": "2",

Can we change any of the heartbeat parameter to come out if this issue??

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #23371

OSDs flaps when cluster network is made down

Updated by Greg Farnum about 6 years ago

Updated by Nokia ceph-users almost 6 years ago