Project

General

Profile

Actions

Bug #55665

open

osd: osd_fast_fail_on_connection_refused will cause the mon to continuously elect

Added by jianwei zhang almost 2 years ago. Updated almost 2 years ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The first issue is described at https://tracker.ceph.com/issues/55067

Problem Description:

When a node is actively shut down for operation and maintenance,
the osd/mon/mds process on it will automatically exit,
However, after the ceph-mon process exits, other mons still need to pass the lease timeout to trigger the election.
osd_fast_shutdown(true)
osd_fast_shutdown_notify_mon(true)
osd_mon_shutdown_timeout(5s) Wait time needs to be greater than mon timeout election time?
osd_fast_fail_on_connection_refused :
immediately mark OSDs as down once they refuse to accept connections
Is this parameter to be recognized by other osd immediately after the osd process crashes?
commit 75074524fe15afff1374a6006628adab4f7abf7b
Author: Piotr Dałek <git@predictor.org.pl>
Date:   Sun May 22 13:08:48 2016 +0200

    OSD: Implement ms_handle_refused

    Added implementation of ms_handle_refused in OSD code, so it sends
    MOSDFailure message in case the peer connection fails with ECONNREFUSED
    *and* it is known to be up and new option "osd fast fail on connection
    refused" which enables or disables new behavior.

    Signed-off-by: Piotr Dałek <git@predictor.org.pl>
osd_fast_fail_on_connection_refused

If this option is enabled, crashed OSDs are marked down immediately by connected peers and MONs (assuming that the crashed OSD host survives). Disable it to restore old behavior, at the expense of possible long I/O stalls when OSDs crash in the middle of I/O operations.

Should we limit it?

When a lot of osd processes exit,
The partner osd of these exited osd will receive ECONNREFUSED,
Then immediately report the target osd_failure to mon

mon has only one processing thread
When a large number of osd_failures occupy the processing thread,
mon failed to complete the election (collect timeout/lease timeout/accept timeout...)
The whole cluster is not working
osd-osd heartbeat timeout

it will add to pending_failure queue

osd::tick thread osd_mon_report_interval(5s) send osd_failure to mon
In this way, it will not put a huge pressure on mon in an instant.

I think we should also use failure_pending queue like send_failures to avoid one osd sending target osd to mon multiple times osd_failure

Although this cannot fundamentally solve the problem that many osds send target osd osd_failure to mon at the same time,
But at least it can reduce the pressure of mon

void OSD::send_failures() {
    ceph_assert(ceph_mutex_is_locked(map_lock));
    ceph_assert(ceph_mutex_is_locked(mon_report_lock));
    std::lock_guard l(heartbeat_lock);
    utime_t now = ceph_clock_now();
    const auto osdmap = get_osdmap();
    while (!failure_queue.empty()) {
        int osd = failure_queue.begin()->first;
        if (!failure_pending.count(osd)) {
            int failed_for = (int)(double)(now - failure_queue.begin()->second);
            monc->send_mon_message(
                new MOSDFailure(monc->get_fsid(), osd, osdmap->get_addrs(osd), failed_for, osdmap->get_epoch()));
            failure_pending[osd] = make_pair(failure_queue.begin()->second, osdmap->get_addrs(osd));
        }
        failure_queue.erase(osd);
    }
}
Actions #1

Updated by jianwei zhang almost 2 years ago

My test result:

# cat osd_debug_osd_failure2.diff 
diff --git a/src/common/legacy_config_opts.h b/src/common/legacy_config_opts.h
index b9bf29f05b..6729263de2 100644
--- a/src/common/legacy_config_opts.h
+++ b/src/common/legacy_config_opts.h
@@ -757,6 +757,7 @@ OPTION(osd_op_log_threshold, OPT_INT) // how many op log messages to show in one
 OPTION(osd_backoff_on_unfound, OPT_BOOL)   // object unfound
 OPTION(osd_backoff_on_degraded, OPT_BOOL) // [mainly for debug?] object unreadable/writeable
 OPTION(osd_backoff_on_peering, OPT_BOOL)  // [debug] pg peering
+OPTION(osd_debug_osd_failure, OPT_BOOL) // count osd send osd_failure to mon messages
 OPTION(osd_debug_crash_on_ignored_backoff, OPT_BOOL) // crash osd if client ignores a backoff; useful for debugging
 OPTION(osd_debug_inject_dispatch_delay_probability, OPT_DOUBLE)
 OPTION(osd_debug_inject_dispatch_delay_duration, OPT_DOUBLE)
diff --git a/src/common/options.cc b/src/common/options.cc
index 1fddfc6836..4558f768fe 100644
--- a/src/common/options.cc
+++ b/src/common/options.cc
@@ -3487,6 +3487,10 @@ std::vector<Option> get_global_options() {
     .set_default(false)
     .set_description("Turn up debug levels during shutdown"),

+    Option("osd_debug_osd_failure", Option::TYPE_BOOL, Option::LEVEL_DEV)
+    .set_default(true)
+    .set_description("print log when osd send osd_failures to mon"),
+
     Option("osd_debug_crash_on_ignored_backoff", Option::TYPE_BOOL, Option::LEVEL_DEV)
     .set_default(false)
     .set_description(""),
diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
index b7216826ce..18f4bbc84b 100644
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -6215,15 +6215,20 @@ bool OSD::ms_handle_refused(Connection *con)
       if (id >= 0 && osdmap->is_up(id)) {
     // I'm cheating mon heartbeat grace logic, because we know it's not going
     // to respawn alone. +1 so we won't hit any boundary case.
-    monc->send_mon_message(
-      new MOSDFailure(
+    MOSDFailure *m = new MOSDFailure(
         monc->get_fsid(),
         id,
         osdmap->get_addrs(id),
         cct->_conf->osd_heartbeat_grace + 1,
         osdmap->get_epoch(),
         MOSDFailure::FLAG_IMMEDIATE | MOSDFailure::FLAG_FAILED
-        ));
+        );
+
+    if (cct->_conf->osd_debug_osd_failure)
+      dout(0) << __func__ << " con=" << con << " session=" << session
+          << " MOSDFailure(FLAG_IMMEDIATE|FLAG_FAILED)=" << *m << dendl;
+
+    monc->send_mon_message(m);
       }
     }
   }
@@ -6697,13 +6702,19 @@ void OSD::send_failures()
     int osd = failure_queue.begin()->first;
     if (!failure_pending.count(osd)) {
       int failed_for = (int)(double)(now - failure_queue.begin()->second);
-      monc->send_mon_message(
-    new MOSDFailure(
+
+      MOSDFailure *m = new MOSDFailure(
       monc->get_fsid(),
       osd,
       osdmap->get_addrs(osd),
       failed_for,
-      osdmap->get_epoch()));
+      osdmap->get_epoch());
+
+      if (cct->_conf->osd_debug_osd_failure)
+    dout(0) << __func__ << " MOSDFailure(FLAG_FAILED)=" << *m << dendl;
+
+      monc->send_mon_message(m);
+
       failure_pending[osd] = make_pair(failure_queue.begin()->second,
                        osdmap->get_addrs(osd));
     }
@@ -6715,6 +6726,10 @@ void OSD::send_still_alive(epoch_t epoch, int osd, const entity_addrvec_t &addrs
 {
   MOSDFailure *m = new MOSDFailure(monc->get_fsid(), osd, addrs, 0, epoch,
                    MOSDFailure::FLAG_ALIVE);
+
+  if (cct->_conf->osd_debug_osd_failure)
+    dout(0) << __func__ << " MOSDFailure(FLAG_ALIVE)=" << *m << dendl;
+
   monc->send_mon_message(m);
 }
osd_fast_fail_on_connection_refused = true
reboot node2
[root@node1 bak-osd-zjw]# grep 2022-05-13T14:2 ceph-osd.*.log|grep MOSDFailure|wc -l
109150
[root@node3 bak-osd-zjw]# grep 2022-05-13T14:2 ceph-osd.*.log|grep MOSDFailure|wc -l
94852

reboot node2
[root@node1 bak-osd-zjw-2]# grep 2022-05-13T14:4 ceph-osd.*.log|grep MOSDFailure|wc -l
89249
[root@node3 bak-osd-zjw-2]# grep 2022-05-13T14:4 ceph-osd.*.log|grep MOSDFailure|wc -l
78628

osd_fast_fail_on_connection_refused = false
reboot node1
[root@node2 bak-osd-zjw-3]# grep 2022-05-13T15:2. ceph-osd.*.log|grep MOSDFailure|wc -l
1386
[root@node3 bak-osd-zjw-3]# grep 2022-05-13T15:2. ceph-osd.*.log|grep MOSDFailure|wc -l
808
Comparing the test results, we can see
osd_fast_fail_on_connection_refused = true Report osd_failure 160,000 to 200,000 times to mon
osd_fast_fail_on_connection_refused = false Report osd_failure 2000 times to mon
100 times difference
Actions #2

Updated by jianwei zhang almost 2 years ago

For scenarios where a single node has both mon and osd, and the number of osd on a single node is large, or the mon election timeout period is shortened

Actions #4

Updated by jianwei zhang almost 2 years ago

Problem:
hardware environment: 3 nodes, 98 * 3 = 294 OSDs, 3mon, 3mgr
For osd_fast_fail_on_connection_refused=true/false,
Comparison of the number of failures sent

osd_fast_fail_on_connection_refused = true
reboot node2
[node1]# grep 2022-05-13T14:2 ceph-osd.*.log|grep osd_failure|wc -l
109150
[node3]# grep 2022-05-13T14:2 ceph-osd.*.log|grep osd_failure|wc -l
94852

osd_fast_fail_on_connection_refused = false
reboot node1
[node2]# grep 2022-05-13T15:2. ceph-osd.*.log|grep osd_failure|wc -l
1386
[node3]# grep 2022-05-13T15:2. ceph-osd.*.log|grep osd_failure|wc -l
808

Comparing the test results, we can see
osd_fast_fail_on_connection_refused = true Report osd_failure 160,000 to 200,000 times to mon
osd_fast_fail_on_connection_refused = false Report osd_failure 2000 times to mon
100 times difference

osd_fast_fail_on_connection_refused = true, It can lead to mon's constant elections.

Solution:
when osd_fast_fail_on_connection_refused is true,
we need to limit the number of osd_failure that send the same osd

add osd_debug_osd_failure option for debug osd_failure log

Reuse the existing failure_pending queue

When the connection refused osd_failure has been sent, it records in the failure_pending,
and the osd-osd heartbeat timeout osd_failure will no longer be sent.

When the osd-osd heartbeat timeout osd_failure has been sent,
but the connection refused osd_failure has not been sent,
the connection refused osd_failure will be sent again
and overwrite the existing osd-osd heartbeat timeout osd_failure in the failure_pending.

Fix : https://tracker.ceph.com/issues/55665
Actions #5

Updated by Radoslaw Zarzynski almost 2 years ago

  • Backport set to quincy,pacific
  • Pull request ID set to 46273
Actions #6

Updated by Radoslaw Zarzynski almost 2 years ago

  • Backport deleted (quincy,pacific)
Actions #7

Updated by Ilya Dryomov almost 2 years ago

  • Target version deleted (v15.2.16)
Actions #8

Updated by Radoslaw Zarzynski almost 2 years ago

  • Status changed from New to Fix Under Review
Actions

Also available in: Atom PDF