Project

General

Profile

Actions

Bug #48821

closed

osd crash in OSD::heartbeat when dereferencing null session

Added by Mykola Golub over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific,octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

For an unhealthy (unstable) cluster with flip-flopping osds we observed crashes like this:

 ceph version 15.2.5-667-g1a579d5bf2 (1a579d5bf275b4ab4e62bd1094ba0e11bc672d01) octopus (stable)
 1: (()+0x132d0) [0x7fd0a6c282d0]
 2: (OSD::heartbeat()+0x514) [0x56448b7c44f4]
 3: (OSD::heartbeat_entry()+0x83) [0x56448b7c51d3]
 4: (OSD::T_Heartbeat::entry()+0xd) [0x56448b83fcad]
 5: (()+0x84f9) [0x7fd0a6c1d4f9]
 6: (clone()+0x3f) [0x7fd0a59c9fbf]

Some details from the debugger:

#bt
#0  0x00007f571a1c5170 in raise () from ./lib64/libpthread.so.0
#1  0x00005641e03cf450 in reraise_fatal (signum=11) at /usr/src/debug/ceph-15.2.5.667+g1a579d5bf2-3.3.1.x86_64/src/global/signal_handler.cc:81
#2  handle_fatal_signal (signum=11) at /usr/src/debug/ceph-15.2.5.667+g1a579d5bf2-3.3.1.x86_64/src/global/signal_handler.cc:326
#3  <signal handler called>
#4  boost::intrusive_ptr<HeartbeatStamps>::operator-> (this=0x1c0) at /usr/src/debug/ceph-15.2.5.667+g1a579d5bf2-3.3.1.x86_64/build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:200
#5  OSD::heartbeat (this=this@entry=0x5641eb90c000) at /usr/src/debug/ceph-15.2.5.667+g1a579d5bf2-3.3.1.x86_64/src/osd/OSD.cc:5695
#6  0x00005641dfdd61d3 in OSD::heartbeat_entry (this=0x5641eb90c000) at /usr/src/debug/ceph-15.2.5.667+g1a579d5bf2-3.3.1.x86_64/src/osd/OSD.cc:5568
#7  0x00005641dfe50cad in OSD::T_Heartbeat::entry (this=<optimized out>) at /usr/src/debug/ceph-15.2.5.667+g1a579d5bf2-3.3.1.x86_64/src/osd/OSD.h:1483
#8  0x00007f571a1ba4f9 in start_thread () from ./lib64/libpthread.so.0
#9  0x00007f5718f66fbf in clone () from ./lib64/libc.so.6
#fr 5
#5  OSD::heartbeat (this=this@entry=0x5641eb90c000) at /usr/src/debug/ceph-15.2.5.667+g1a579d5bf2-3.3.1.x86_64/src/osd/OSD.cc:5695
5695        s->stamps->sent_ping(&delta_ub);
#l
5690        if (i->second.hb_interval_start == utime_t())
5691          i->second.hb_interval_start = now;
5692    
5693        Session *s = static_cast<Session*>(i->second.con_back->get_priv().get());
5694        std::optional<ceph::signedspan> delta_ub;
5695        s->stamps->sent_ping(&delta_ub);
5696    
5697        i->second.con_back->send_message(
5698          new MOSDPing(monc->get_fsid(),
5699               service.get_osdmap_epoch(),
#fr 4
#4  boost::intrusive_ptr<HeartbeatStamps>::operator-> (this=0x1c0) at /usr/src/debug/ceph-15.2.5.667+g1a579d5bf2-3.3.1.x86_64/build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:200
200            return px;
#p this
$8 = (const boost::intrusive_ptr<HeartbeatStamps> * const) 0x1c0

So it crashes trying to dereferrence a session pointer which is null (probably reset by ms_handle_reset?).


Related issues 2 (0 open2 closed)

Copied to RADOS - Backport #49008: pacific: osd crash in OSD::heartbeat when dereferencing null sessionResolvedsinguliere _Actions
Copied to RADOS - Backport #49009: octopus: osd crash in OSD::heartbeat when dereferencing null sessionResolvedsinguliere _Actions
Actions #1

Updated by Mykola Golub over 3 years ago

The fix seems just to check that the session pointer is not null before trying to use it. If the problem is not deeper...

Actions #2

Updated by Neha Ojha over 3 years ago

sounds right, would you like to create a quick PR for this?

Actions #3

Updated by Mykola Golub over 3 years ago

  • Status changed from New to In Progress
  • Backport set to octopus
Actions #4

Updated by Mykola Golub over 3 years ago

  • Backport changed from octopus to pacific,octopus
Actions #5

Updated by Mykola Golub over 3 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 38931
Actions #6

Updated by Kefu Chai about 3 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #7

Updated by Backport Bot about 3 years ago

  • Copied to Backport #49008: pacific: osd crash in OSD::heartbeat when dereferencing null session added
Actions #8

Updated by Backport Bot about 3 years ago

  • Copied to Backport #49009: octopus: osd crash in OSD::heartbeat when dereferencing null session added
Actions #9

Updated by Loïc Dachary about 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF