Project

General

Profile

Actions

Bug #55851

closed

Assert in Ceph messenger

Added by Stefan Kooman almost 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
% Done:

100%

Source:
Community (user)
Tags:
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):

Description

Context:

Ceph balancer was busy balancing: PG remaps

{
    "assert_condition": "!m_descs.empty()",
    "assert_file": "/build/ceph-15.2.16/src/msg/async/frames_v2.cc",
    "assert_func": "uint64_t ceph::msgr::v2::FrameAssembler::get_frame_onwire_len() const",
    "assert_line": 90,
    "assert_msg": "/build/ceph-15.2.16/src/msg/async/frames_v2.cc: In function 'uint64_t ceph::msgr::v2::FrameAssembler::get_frame_onwire_len() const' thread 7f44787c5700 time 2022-06-03T11:35:09.256910+0200\n/build/ceph-15.2.16/src/msg/async/frames_v2.cc: 90: FAILED ceph_assert(!m_descs.empty())\n",
    "assert_thread_name": "msgr-worker-2",
    "backtrace": [
        "(()+0x143c0) [0x7f447a6be3c0]",
        "(gsignal()+0xcb) [0x7f447a14003b]",
        "(abort()+0x12b) [0x7f447a11f859]",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x556d6548ca2a]",
        "(()+0x9e9bd7) [0x556d6548cbd7]",
        "(ceph::msgr::v2::FrameAssembler::get_frame_onwire_len() const+0x19c) [0x556d6600d77c]",
        "(bool ProtocolV2::append_frame<ceph::msgr::v2::MessageFrame>(ceph::msgr::v2::MessageFrame&)+0x88) [0x556d660047f8]",
        "(ProtocolV2::write_message(Message*, bool)+0x3a5) [0x556d65fe44a5]",
        "(ProtocolV2::write_event()+0x283) [0x556d65ffb363]",
        "(AsyncConnection::handle_write()+0x47) [0x556d65fb50f7]",
        "(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xd67) [0x556d65de5657]",
        "(()+0x1348162) [0x556d65deb162]",
        "(()+0xd6de4) [0x7f447a531de4]",
        "(()+0x8609) [0x7f447a6b2609]",
        "(clone()+0x43) [0x7f447a21c163]" 
    ],
    "ceph_version": "15.2.16",
    "crash_id": "2022-06-03T09:35:09.281444Z_204a2de6-6f76-42bb-9ad5-e990e371658f",
    "entity_name": "osd.117",
    "os_id": "ubuntu",
    "os_name": "Ubuntu",
    "os_version": "20.04.4 LTS (Focal Fossa)",
    "os_version_id": "20.04",
    "process_name": "ceph-osd",
    "stack_sig": "9174ebc9cd5add86feec2fcbfd333033d1913695cf484150e2ed5fb674b0acd5",
    "timestamp": "2022-06-03T09:35:09.281444Z",
    "utsname_hostname": "storage12",
    "utsname_machine": "x86_64",
    "utsname_release": "5.4.0-107-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#121-Ubuntu SMP Thu Mar 24 16:04:27 UTC 2022" 
}

Related issues 2 (0 open2 closed)

Copied to RADOS - Backport #57257: quincy: Assert in Ceph messengerResolvedRadoslaw ZarzynskiActions
Copied to RADOS - Backport #57258: pacific: Assert in Ceph messengerResolvedKonstantin ShalyginActions
Actions #1

Updated by Venky Shankar almost 2 years ago

  • Project changed from CephFS to RADOS
Actions #2

Updated by Patrick Donnelly almost 2 years ago

  • Description updated (diff)
Actions #3

Updated by Radoslaw Zarzynski almost 2 years ago

  • Assignee set to Radoslaw Zarzynski
Actions #5

Updated by Yaarit Hatuka almost 2 years ago

  • Crash signature (v2) updated (diff)
Actions #6

Updated by Yaarit Hatuka almost 2 years ago

  • Crash signature (v2) updated (diff)
Actions #7

Updated by Ilya Dryomov almost 2 years ago

  • Target version deleted (v15.2.16)
Actions #8

Updated by Radoslaw Zarzynski over 1 year ago

The number of elements in FrameAssembler::m_desc can be altered only by:

1.

bufferlist FrameAssembler::assemble_frame(Tag tag, bufferlist segment_bls[],
                                          const uint16_t segment_aligns[],
                                          size_t segment_count) {
  m_flags = 0;
  m_descs.resize(calc_num_segments(segment_bls, segment_count));
  // ...
}

2.

void set_is_rev1(bool is_rev1) {
  m_descs.clear();
  m_flags = 0;
  m_is_rev1 = is_rev1;
}

set_is_rev1() can be called from:
1. ProtocolV2::_handle_peer_banner_payload() (the hello frame sequence) or
2. ProtocolV2::reuse_connection().

Guess which one is the culprit! ;-)

Actions #9

Updated by Radoslaw Zarzynski over 1 year ago

It looks reuse_connection() holds the

CtPtr ProtocolV2::reuse_connection(const AsyncConnectionRef& existing,
                                   ProtocolV2 *exproto) {
  // ...
  std::lock_guard<std::mutex> l(existing->write_lock);

  // ...
  if (!reconnecting) {
    exproto->peer_supported_features = peer_supported_features;
    exproto->tx_frame_asm.set_is_rev1(tx_frame_asm.get_is_rev1());
    exproto->rx_frame_asm.set_is_rev1(rx_frame_asm.get_is_rev1());

    // ...
  }
  // ...
}

Let's see the caller of write_message():

void ProtocolV2::write_event() {
      // ...
      connection->write_lock.unlock();

      // send_message or requeue messages may not encode message
      if (!out_entry.is_prepared) {
        prepare_send_message(connection->get_features(), out_entry.m);
      }

      if (out_entry.m->queue_start != ceph::mono_time()) {
        connection->logger->tinc(l_msgr_send_messages_queue_lat,
                                 ceph::mono_clock::now() -
                                 out_entry.m->queue_start);
      }

      r = write_message(out_entry.m, more);

      connection->write_lock.lock();

write_message() only takes the AsyncConnection::lock which is not enough. Oops!

ssize_t ProtocolV2::write_message(Message *m, bool more) {
  FUNCTRACE(cct);
  ceph_assert(connection->center->in_thread());
  m->set_seq(++out_seq);

  connection->lock.lock();
  uint64_t ack_seq = in_seq;
  ack_left = 0;
  connection->lock.unlock();

  ceph_msg_header &header = m->get_header();
  ceph_msg_footer &footer = m->get_footer();

  ceph_msg_header2 header2{header.seq,        header.tid,
                           header.type,       header.priority,
                           header.version,
                           ceph_le32(0),      header.data_off,
                           ceph_le64(ack_seq),
                           footer.flags,      header.compat_version,
                           header.reserved};

  auto message = MessageFrame::Encode(
                             header2,
                             m->get_payload(),
                             m->get_middle(),
                             m->get_data());
  if (!append_frame(message)) {
    m->put();
    return -EILSEQ;
  }
Actions #10

Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from New to In Progress
Actions #11

Updated by Radoslaw Zarzynski over 1 year ago

Perhaps we should move into deactivate_existing part of reuse_connection() where we hold both locks the same time.

Actions #12

Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 47301
Actions #13

Updated by Radoslaw Zarzynski over 1 year ago

  • Backport set to pacific,quincy
Actions #14

Updated by Kefu Chai over 1 year ago

  • Status changed from Fix Under Review to Pending Backport
Actions #15

Updated by Backport Bot over 1 year ago

Actions #16

Updated by Backport Bot over 1 year ago

Actions #17

Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed
Actions #18

Updated by Konstantin Shalygin over 1 year ago

  • Status changed from Pending Backport to Resolved
  • Target version set to v18.0.0
  • % Done changed from 0 to 100
  • Tags deleted (backport_processed)
Actions

Also available in: Atom PDF