Bug #63636: Catastrophic failure when ECONNREFUSED is triggered - Ceph - Ceph

Actions

Copy link

Bug #63636

open

Catastrophic failure when ECONNREFUSED is triggered

Added by Denis Krienbühl 5 months ago. Updated 5 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

v16.2.14

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We’ve recently had a serious outage at work, after a host had a network problem. This was seemingly caused by the osd_fast_fail_on_connection_refused implementation found in the following PR, merged back in 2016

https://github.com/ceph/ceph/pull/8558

As our cluster was fully offline for 15 minutes we have investigated this in detail. We think this is an unintended side effect of the change introduced above and should be addressed by the Ceph project. We are confident that we have mitigated the issue sufficiently on our end, but if we understand this correctly, then this has the potential for causing havoc elsewhere.

Environment

We use a Ceph Pacific cluster, with 9 NVMe based and 6 SSD based nodes, distributed across three racks.

Each host is attached to a management network (with default route), and a north/south bond pair that connects to the other OSDs and mons.

Outage

After a hardware issue on one of our SSD nodes, we replaced the NIC on the host and rebooted it. Shortly after the boot completed, OSDs all across the clusters were killed repeatedly until their systemd units gave up.

Suspecting some connection to the booted host we shut it down after 5 minutes, at which point we were able to manually start the OSDs again, bringing the cluster back online:

2023-11-09 13:22:46
prod-bulk1-c-rma1 finished shutdown.

2023-11-09 13:34:05
prod-bulk1-c-rma1 starts sending logs again.

2023-11-09 13:34:07
OSDs on prod-bulk1-c-rma1 are starting.

2023-11-09 13:35:00
Other OSDs are force failed by the mons.

2023-11-09 13:39:14
prod-bulk1-c-rma1 is powered off.

2023-11-09 13:45:58
OSDs are started manually.

2023-11-09 13:50:41
Cluster is online again.

Cause

The OSDs on the prod-bulk1-c-rma1 host repeatedly reported OSDs from other hosts as immediately failed. The mon acted on these reports without taking any context into consideration.

Only by shutting down prod-bulk1-c-rma1 was it possible to start the other OSDs again, which otherwise would have been shut down in a loop indefinitely.

Protections

Protections on the mon are meant to analyze reports before acting on them, applying rules that prevent a single host or rack to cause all other OSDs to be stopped when a switch is broken.

See https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/#osds-report-down-osds

We use the default settings of osd_reporter_subtree_level=host and osd_min_down_reporters=2, which means that we require two hosts to report an OSD to be down, not just one.

This protection did not work in this outage.

A hint of this can be found in the log. Here osd.437 on prod-bulk1-c-rma1 reports osd.433 on prod-bulk2-a-rma as immediately failed. This output is from the primary mon at the time and repeats like that over and over for numerous OSDs:

2023-11-09 13:34:55.502288449 +0100 CET prod-nvme1-a-rma1 ...
    osd.433 reported immediately failed by osd.437
2023-11-09 13:34:55.502335559 +0100 CET prod-nvme1-a-rma1 ...
    osd.433 failed (root=bulk,rack=bulk-a,host=prod-bulk2-a-rma1) (connection refused reported by osd.437)
2023-11-09 13:34:55.504156497 +0100 CET prod-nvme1-a-rma1 ...
    osd.433 reported immediately failed by osd.437

Usually a mon reports failed OSDs and waits for confirmation from multiple OSDs and hosts. Here, ODSs were immediately killed.

Code

A search for reported immediately in the pacific branch of the Ceph source code yields the following code:

if (m->if_osd_failed()) {
    // calculate failure time
    utime_t now = ceph_clock_now();
    utime_t failed_since =
      m->get_recv_stamp() - utime_t(m->failed_for, 0);

    // add a report
    if (m->is_immediate()) {
      mon.clog->debug() << "osd." << m->get_target_osd()
             << " reported immediately failed by " 
             << m->get_orig_source();
      force_failure(target_osd, reporter);
      return true;
    }
    ...
}

https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/mon/OSDMonitor.cc#L3360

is_immediate is suspicious, and is defined as follows:

  bool is_immediate() const { 
    return flags & FLAG_IMMEDIATE; 
  }

https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/messages/MOSDFailure.h#L68

FLAG_IMMEDIATE is set exactly once in the whole Ceph codebase:

bool OSD::ms_handle_refused(Connection *con)
{
  if (!cct->_conf->osd_fast_fail_on_connection_refused)
    return false;

  auto session = ceph::ref_cast<Session>(con->get_priv());
  dout(2) << "ms_handle_refused con " << con << " session " << session.get() << dendl;
  if (!session)
    return false;
  int type = con->get_peer_type();
  // handle only OSD failures here
  if (monc && (type == CEPH_ENTITY_TYPE_OSD)) {
    OSDMapRef osdmap = get_osdmap();
    if (osdmap) {
      int id = osdmap->identify_osd_on_all_channels(con->get_peer_addr());
      if (id >= 0 && osdmap->is_up(id)) {
    // I'm cheating mon heartbeat grace logic, because we know it's not going
    // to respawn alone. +1 so we won't hit any boundary case.
    monc->send_mon_message(
      new MOSDFailure(
        monc->get_fsid(),
        id,
        osdmap->get_addrs(id),
        cct->_conf->osd_heartbeat_grace + 1,
        osdmap->get_epoch(),
        MOSDFailure::FLAG_IMMEDIATE | MOSDFailure::FLAG_FAILED
        ));
      }
    }
  }
  return true;
}

https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/osd/OSD.cc#L6434

Unless osd_fast_fail_on_connection_refused is set to false, (default is true), this connection refused handler tells the mon to throw caution to the wind and shut down the OSD right away.

By following the ms_handle_refused function up the chain, it becomes apparent that this is caused by ECONNREFUSED on certain messages:

  void ms_deliver_handle_refused(Connection *con) {
    for (const auto& dispatcher : dispatchers) {
      if (dispatcher->ms_handle_refused(con))
        return;
    }
  }

https://github.com/ceph/ceph/blob/7325a8d317d1cef199306833e7829b171378266b/src/msg/Messenger.h#L808

    case D_CONN_REFUSED:
      msgr->ms_deliver_handle_refused(qitem.get_connection());
      break;

https://github.com/ceph/ceph/blob/7325a8d317d1cef199306833e7829b171378266b/src/msg/DispatchQueue.cc#L189

  void queue_refused(Connection *con) {
    std::lock_guard l{lock};
    if (stop)
      return;
    mqueue.enqueue_strict(
      0,
      CEPH_MSG_PRIO_HIGHEST,
      QueueItem(D_CONN_REFUSED, con));
    cond.notify_all();
  }

https://github.com/ceph/ceph/blob/7325a8d317d1cef199306833e7829b171378266b/src/msg/DispatchQueue.h#L189

        if (r == -ECONNREFUSED) {
          ldout(async_msgr->cct, 2)
              << __func__ << " connection refused!" << dendl;
          dispatch_queue->queue_refused(this);
        }

https://github.com/ceph/ceph/blob/7325a8d317d1cef199306833e7829b171378266b/src/msg/async/AsyncConnection.cc#L432

This means that OSDs from prod-bulk1-c-rma1 were unable to access other OSDs, in a way that caused ECONNREFUSED at the socket level.

Connection Refused

Unbeknownst to us, the faulty host booted with a mistake in the network config, that caused the south bond to be down. Once the OSDs on the host started, they were able to connect to the mons and OSDs via north, but not via south.

The host also carried a default route to a gateway that recently started refusing traffic, instead of dropping it. With the network configuration for south gone, packets meant for its network were sent via default route to the gateway instead.

This triggered the code path above and forced the mons to take other OSDs offline.

Reproducing the Error

In our case, we were able to relatively easily test this in our lab environment. If we take south down with our default route in place, OSDs across the cluster start failing predictably. Once south returns, the other OSDs can be started again.

This could presumably be reproduced in other settings like this:

1. Have three hosts with OSDs.
2. Add an iptable rule to reject outgoing heartbeat traffic from one host to the others (instead of dropping it).
3. Wait, or maybe restart an OSD on the host.

Mitigation

The best way to ensure this does not happen, is to drop traffic, not reject it.

We were able to solve this in our tests with iptables rules, but ended up changing the network configuration to no longer use a default route. If a network interface is down, the traffic cannot go anywhere now, causing it to be dropped.

With this mitigation in place, the mon correctly detects that the problem is isolated to a single host and refuses to take the other OSDs offline.

One could also set osd_fast_fail_on_connection_refused=true to false, but in our tests crashing OSDs would then be kept in the cluster for much longer, causing slow ops.

Actions

Copy link

Updated by Konstantin Shalygin 5 months ago

Source set to Community (user)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #63636

Catastrophic failure when ECONNREFUSED is triggered

Updated by Konstantin Shalygin 5 months ago