Bug #55851
Assert in Ceph messenger
% Done:
100%
Source:
Community (user)
Tags:
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Description
Context:
Ceph balancer was busy balancing: PG remaps
{ "assert_condition": "!m_descs.empty()", "assert_file": "/build/ceph-15.2.16/src/msg/async/frames_v2.cc", "assert_func": "uint64_t ceph::msgr::v2::FrameAssembler::get_frame_onwire_len() const", "assert_line": 90, "assert_msg": "/build/ceph-15.2.16/src/msg/async/frames_v2.cc: In function 'uint64_t ceph::msgr::v2::FrameAssembler::get_frame_onwire_len() const' thread 7f44787c5700 time 2022-06-03T11:35:09.256910+0200\n/build/ceph-15.2.16/src/msg/async/frames_v2.cc: 90: FAILED ceph_assert(!m_descs.empty())\n", "assert_thread_name": "msgr-worker-2", "backtrace": [ "(()+0x143c0) [0x7f447a6be3c0]", "(gsignal()+0xcb) [0x7f447a14003b]", "(abort()+0x12b) [0x7f447a11f859]", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x556d6548ca2a]", "(()+0x9e9bd7) [0x556d6548cbd7]", "(ceph::msgr::v2::FrameAssembler::get_frame_onwire_len() const+0x19c) [0x556d6600d77c]", "(bool ProtocolV2::append_frame<ceph::msgr::v2::MessageFrame>(ceph::msgr::v2::MessageFrame&)+0x88) [0x556d660047f8]", "(ProtocolV2::write_message(Message*, bool)+0x3a5) [0x556d65fe44a5]", "(ProtocolV2::write_event()+0x283) [0x556d65ffb363]", "(AsyncConnection::handle_write()+0x47) [0x556d65fb50f7]", "(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xd67) [0x556d65de5657]", "(()+0x1348162) [0x556d65deb162]", "(()+0xd6de4) [0x7f447a531de4]", "(()+0x8609) [0x7f447a6b2609]", "(clone()+0x43) [0x7f447a21c163]" ], "ceph_version": "15.2.16", "crash_id": "2022-06-03T09:35:09.281444Z_204a2de6-6f76-42bb-9ad5-e990e371658f", "entity_name": "osd.117", "os_id": "ubuntu", "os_name": "Ubuntu", "os_version": "20.04.4 LTS (Focal Fossa)", "os_version_id": "20.04", "process_name": "ceph-osd", "stack_sig": "9174ebc9cd5add86feec2fcbfd333033d1913695cf484150e2ed5fb674b0acd5", "timestamp": "2022-06-03T09:35:09.281444Z", "utsname_hostname": "storage12", "utsname_machine": "x86_64", "utsname_release": "5.4.0-107-generic", "utsname_sysname": "Linux", "utsname_version": "#121-Ubuntu SMP Thu Mar 24 16:04:27 UTC 2022" }
Related issues
History
#1 Updated by Venky Shankar over 1 year ago
- Project changed from CephFS to RADOS
#2 Updated by Patrick Donnelly over 1 year ago
- Description updated (diff)
#3 Updated by Radoslaw Zarzynski over 1 year ago
- Assignee set to Radoslaw Zarzynski
#4 Updated by Radoslaw Zarzynski over 1 year ago
From Neha:
- http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?var-sig_v2=12eed3bdd041d05365ec4192181b37d22054f1fec42e589c2e1fc8e6993ed467&orgId=1
- http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?var-sig_v2=5532f5c395e705ebfcbe87dae6ae4269e074636a7dae01956a69d44d02f5f401&orgId=1
- http://telemetry.front.sepia.ceph.com:4000/d/Nvj6XTaMk/spec-search?orgId=1&var-substr_1=&var-substr_2=&var-substr_3=&var-majors_affected=&var-minors_affected=&var-assert_function=&var-assert_condition=m_descs.empty()&var-total_results=2&var-sig_v1=&var-sig_v2=&var-daemons=&var-only_new_fingerprints=false&var-status_description=All&var-only_open=false
#5 Updated by Yaarit Hatuka over 1 year ago
- Crash signature (v2) updated (diff)
#6 Updated by Yaarit Hatuka over 1 year ago
- Crash signature (v2) updated (diff)
#7 Updated by Ilya Dryomov about 1 year ago
- Target version deleted (
v15.2.16)
#8 Updated by Radoslaw Zarzynski about 1 year ago
The number of elements in FrameAssembler::m_desc
can be altered only by:
1.
bufferlist FrameAssembler::assemble_frame(Tag tag, bufferlist segment_bls[], const uint16_t segment_aligns[], size_t segment_count) { m_flags = 0; m_descs.resize(calc_num_segments(segment_bls, segment_count)); // ... }
2.
void set_is_rev1(bool is_rev1) { m_descs.clear(); m_flags = 0; m_is_rev1 = is_rev1; }
set_is_rev1()
can be called from:
1. ProtocolV2::_handle_peer_banner_payload()
(the hello frame sequence) or
2. ProtocolV2::reuse_connection()
.
Guess which one is the culprit! ;-)
#9 Updated by Radoslaw Zarzynski about 1 year ago
It looks reuse_connection()
holds the
CtPtr ProtocolV2::reuse_connection(const AsyncConnectionRef& existing, ProtocolV2 *exproto) { // ... std::lock_guard<std::mutex> l(existing->write_lock); // ... if (!reconnecting) { exproto->peer_supported_features = peer_supported_features; exproto->tx_frame_asm.set_is_rev1(tx_frame_asm.get_is_rev1()); exproto->rx_frame_asm.set_is_rev1(rx_frame_asm.get_is_rev1()); // ... } // ... }
Let's see the caller of write_message()
:
void ProtocolV2::write_event() { // ... connection->write_lock.unlock(); // send_message or requeue messages may not encode message if (!out_entry.is_prepared) { prepare_send_message(connection->get_features(), out_entry.m); } if (out_entry.m->queue_start != ceph::mono_time()) { connection->logger->tinc(l_msgr_send_messages_queue_lat, ceph::mono_clock::now() - out_entry.m->queue_start); } r = write_message(out_entry.m, more); connection->write_lock.lock();
write_message()
only takes the AsyncConnection::lock
which is not enough. Oops!
ssize_t ProtocolV2::write_message(Message *m, bool more) { FUNCTRACE(cct); ceph_assert(connection->center->in_thread()); m->set_seq(++out_seq); connection->lock.lock(); uint64_t ack_seq = in_seq; ack_left = 0; connection->lock.unlock(); ceph_msg_header &header = m->get_header(); ceph_msg_footer &footer = m->get_footer(); ceph_msg_header2 header2{header.seq, header.tid, header.type, header.priority, header.version, ceph_le32(0), header.data_off, ceph_le64(ack_seq), footer.flags, header.compat_version, header.reserved}; auto message = MessageFrame::Encode( header2, m->get_payload(), m->get_middle(), m->get_data()); if (!append_frame(message)) { m->put(); return -EILSEQ; }
#10 Updated by Radoslaw Zarzynski about 1 year ago
- Status changed from New to In Progress
#11 Updated by Radoslaw Zarzynski about 1 year ago
Perhaps we should move into deactivate_existing
part of reuse_connection()
where we hold both locks the same time.
#12 Updated by Radoslaw Zarzynski about 1 year ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 47301
#13 Updated by Radoslaw Zarzynski about 1 year ago
- Backport set to pacific,quincy
#14 Updated by Kefu Chai about 1 year ago
- Status changed from Fix Under Review to Pending Backport
#15 Updated by Backport Bot about 1 year ago
- Copied to Backport #57257: quincy: Assert in Ceph messenger added
#16 Updated by Backport Bot about 1 year ago
- Copied to Backport #57258: pacific: Assert in Ceph messenger added
#17 Updated by Backport Bot about 1 year ago
- Tags set to backport_processed
#18 Updated by Konstantin Shalygin 10 months ago
- Status changed from Pending Backport to Resolved
- Target version set to v18.0.0
- % Done changed from 0 to 100
- Tags deleted (
backport_processed)