Bug #22045: OSDMonitor: osd down by monitor is delayed - RADOS - Ceph

Actions

Copy link

Bug #22045

open

OSDMonitor: osd down by monitor is delayed

Added by Tang Jin over 6 years ago. Updated over 6 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

Ceph - v13.0.0

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Monitor

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Cluster is a 3-hosts cluster and each host has a monitor, a mgr and serval osds. The options are all default except mon_osd_min_down_reporters is changed to 3, so 'osd check failure' will not be able to down osds.

Now to drop cable of public network for one of hosts, and to observe when osds in this hosts will be down in "ceph -s".

The result is some osds are down in first round of mon_osd_report_timeout seconds, and the others will be down in second round of mon_osd_report_timeout seconds, they are not down in the same time.

Actions

Copy link

Updated by Tang Jin over 6 years ago

After new election, the leader monitor doesn't change. The leader will receive many OSDBeacons of part of down osds by resend_routed_requests from other peon monitor, on the other hand, the others' OSDBeacon will never be received again. It will make two parts of osds which aren't down in same time.

Actions

Copy link