Project

General

Profile

Bug #56808

crash: LogSegment* MDLog::get_current_segment(): assert(!segments.empty())

Added by Telemetry Bot 4 months ago. Updated 2 months ago.

Status:
In Progress
Priority:
Low
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Telemetry
Tags:
Backport:
pacific,quincy
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):

519f0510e677d0f076d7e4dd916d86cea61d0049a67f803e64ed6ef1f8814409


Description

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=70db1b6eecab75317a1e77bd7fedf48eb1293aa77704f087488b1a54f15022a6

Assert condition: !segments.empty()
Assert function: LogSegment* MDLog::get_current_segment()

Sanitized backtrace:

    Server::journal_close_session(Session*, int, Context*)
    Server::kill_session(Session*, Context*)
    Server::apply_blocklist()
    MDSRank::apply_blocklist(std::set<entity_addr_t, std::less<entity_addr_t>, std::allocator<entity_addr_t> > const&, unsigned int)
    MDSRankDispatcher::handle_osd_map()
    MDSDaemon::handle_core_message(boost::intrusive_ptr<Message const> const&)
    MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)
    DispatchQueue::entry()
    DispatchQueue::DispatchThread::entry()

Crash dump sample:
{
    "assert_condition": "!segments.empty()",
    "assert_file": "mds/MDLog.h",
    "assert_func": "LogSegment* MDLog::get_current_segment()",
    "assert_line": 99,
    "assert_msg": "mds/MDLog.h: In function 'LogSegment* MDLog::get_current_segment()' thread 7fc18059e700 time 2022-07-21T14:40:03.922236+0000\nmds/MDLog.h: 99: FAILED ceph_assert(!segments.empty())",
    "assert_thread_name": "ms_dispatch",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12ce0) [0x7fc187fb7ce0]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x7fc188fd6c32]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x283df5) [0x7fc188fd6df5]",
        "(Server::journal_close_session(Session*, int, Context*)+0x8d5) [0x558826e22a05]",
        "(Server::kill_session(Session*, Context*)+0x212) [0x558826e23012]",
        "(Server::apply_blocklist()+0x10d) [0x558826e232cd]",
        "(MDSRank::apply_blocklist(std::set<entity_addr_t, std::less<entity_addr_t>, std::allocator<entity_addr_t> > const&, unsigned int)+0x34) [0x558826ddf024]",
        "(MDSRankDispatcher::handle_osd_map()+0xf6) [0x558826ddf366]",
        "(MDSDaemon::handle_core_message(boost::intrusive_ptr<Message const> const&)+0x39b) [0x558826dc85bb]",
        "(MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0xc3) [0x558826dc8f73]",
        "(DispatchQueue::entry()+0x14fa) [0x7fc18925d43a]",
        "(DispatchQueue::DispatchThread::entry()+0x11) [0x7fc189314581]",
        "/lib64/libpthread.so.0(+0x81ca) [0x7fc187fad1ca]",
        "clone()" 
    ],
    "ceph_version": "17.2.1",
    "crash_id": "2022-07-21T14:40:03.929752Z_e2ff8687-eb24-49a0-9f80-caba2d348573",
    "entity_name": "mds.022ecb20699777119741767351c32be151715ba3",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mds",
    "stack_sig": "519f0510e677d0f076d7e4dd916d86cea61d0049a67f803e64ed6ef1f8814409",
    "timestamp": "2022-07-21T14:40:03.929752Z",
    "utsname_machine": "x86_64",
    "utsname_release": "5.15.52-1-lts",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Sat, 02 Jul 2022 20:04:03 +0000" 
}

History

#1 Updated by Telemetry Bot 4 months ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v17.2.1 added

#2 Updated by Venky Shankar 4 months ago

  • Category set to Correctness/Safety
  • Assignee set to Kotresh Hiremath Ravishankar
  • Target version set to v18.0.0
  • Backport set to pacific,quincy
  • Severity changed from 3 - minor to 2 - major
  • Component(FS) MDS added
  • Labels (FS) crash added

Looks similar to https://tracker.ceph.com/issues/51589 which was fixed a while ago.

Kotresh, please RCA this.

#3 Updated by Kotresh Hiremath Ravishankar 2 months ago

  • Status changed from New to In Progress
  • Priority changed from Normal to Low

This seems to be fixed by the PR https://github.com/ceph/ceph/pull/46833
Unfortunately, we don't have mds logs associated with this crash to know more about the mds state during which the crash happened.
After looking into the code, I think mds should have been in 'standby_replay' mode. The rationale is that the replay_thread
initializes the journal segment when the mds moves to 'standby_replay' from the 'standby'. There could be a race before this journal
segment is initialized, osd_map to journal the blocklisting of clients might have come. I tried reproducing this scenario by
instrumenting the code but couldn't. I might be missing something in reproducing steps here. The PR https://github.com/ceph/ceph/pull/46833
doesn't allow the journal of blocklisting events if the mds is in 'standby_replay' or any replay state. This should solve the problem.

So lowering the prority of the tracker and waiting to see if it happens in 17.2.4 which has the above mentioned fix.

Also available in: Atom PDF