Project

General

Profile

Actions

Bug #55665

open

osd: osd_fast_fail_on_connection_refused will cause the mon to continuously elect

Added by jianwei zhang about 2 years ago. Updated almost 2 years ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The first issue is described at https://tracker.ceph.com/issues/55067

Problem Description:

When a node is actively shut down for operation and maintenance,
the osd/mon/mds process on it will automatically exit,
However, after the ceph-mon process exits, other mons still need to pass the lease timeout to trigger the election.
osd_fast_shutdown(true)
osd_fast_shutdown_notify_mon(true)
osd_mon_shutdown_timeout(5s) Wait time needs to be greater than mon timeout election time?
osd_fast_fail_on_connection_refused :
immediately mark OSDs as down once they refuse to accept connections
Is this parameter to be recognized by other osd immediately after the osd process crashes?
commit 75074524fe15afff1374a6006628adab4f7abf7b
Author: Piotr Dałek <git@predictor.org.pl>
Date:   Sun May 22 13:08:48 2016 +0200

    OSD: Implement ms_handle_refused

    Added implementation of ms_handle_refused in OSD code, so it sends
    MOSDFailure message in case the peer connection fails with ECONNREFUSED
    *and* it is known to be up and new option "osd fast fail on connection
    refused" which enables or disables new behavior.

    Signed-off-by: Piotr Dałek <git@predictor.org.pl>
osd_fast_fail_on_connection_refused

If this option is enabled, crashed OSDs are marked down immediately by connected peers and MONs (assuming that the crashed OSD host survives). Disable it to restore old behavior, at the expense of possible long I/O stalls when OSDs crash in the middle of I/O operations.

Should we limit it?

When a lot of osd processes exit,
The partner osd of these exited osd will receive ECONNREFUSED,
Then immediately report the target osd_failure to mon

mon has only one processing thread
When a large number of osd_failures occupy the processing thread,
mon failed to complete the election (collect timeout/lease timeout/accept timeout...)
The whole cluster is not working
osd-osd heartbeat timeout

it will add to pending_failure queue

osd::tick thread osd_mon_report_interval(5s) send osd_failure to mon
In this way, it will not put a huge pressure on mon in an instant.

I think we should also use failure_pending queue like send_failures to avoid one osd sending target osd to mon multiple times osd_failure

Although this cannot fundamentally solve the problem that many osds send target osd osd_failure to mon at the same time,
But at least it can reduce the pressure of mon

void OSD::send_failures() {
    ceph_assert(ceph_mutex_is_locked(map_lock));
    ceph_assert(ceph_mutex_is_locked(mon_report_lock));
    std::lock_guard l(heartbeat_lock);
    utime_t now = ceph_clock_now();
    const auto osdmap = get_osdmap();
    while (!failure_queue.empty()) {
        int osd = failure_queue.begin()->first;
        if (!failure_pending.count(osd)) {
            int failed_for = (int)(double)(now - failure_queue.begin()->second);
            monc->send_mon_message(
                new MOSDFailure(monc->get_fsid(), osd, osdmap->get_addrs(osd), failed_for, osdmap->get_epoch()));
            failure_pending[osd] = make_pair(failure_queue.begin()->second, osdmap->get_addrs(osd));
        }
        failure_queue.erase(osd);
    }
}
Actions

Also available in: Atom PDF