Bug #56808: crash: LogSegment* MDLog::get_current_segment(): assert(!segments.empty()) - CephFS - Ceph

Actions

Copy link

Bug #56808

open

crash: LogSegment* MDLog::get_current_segment(): assert(!segments.empty())

Added by Telemetry Bot over 1 year ago. Updated 7 months ago.

Status:

In Progress

Priority:

Low

Assignee:

Kotresh Hiremath Ravishankar

Category:

Correctness/Safety

Target version:

% Done:

Source:

Telemetry

Tags:

Backport:

pacific,quincy

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v17.2.1

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

crash

Pull request ID:

Crash signature (v1):

519f0510e677d0f076d7e4dd916d86cea61d0049a67f803e64ed6ef1f8814409

Crash signature (v2):

70db1b6eecab75317a1e77bd7fedf48eb1293aa77704f087488b1a54f15022a6

Description

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=70db1b6eecab75317a1e77bd7fedf48eb1293aa77704f087488b1a54f15022a6

Assert condition: !segments.empty()
Assert function: LogSegment* MDLog::get_current_segment()

Sanitized backtrace:

    Server::journal_close_session(Session*, int, Context*)
    Server::kill_session(Session*, Context*)
    Server::apply_blocklist()
    MDSRank::apply_blocklist(std::set<entity_addr_t, std::less<entity_addr_t>, std::allocator<entity_addr_t> > const&, unsigned int)
    MDSRankDispatcher::handle_osd_map()
    MDSDaemon::handle_core_message(boost::intrusive_ptr<Message const> const&)
    MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)
    DispatchQueue::entry()
    DispatchQueue::DispatchThread::entry()

Crash dump sample:

{
    "assert_condition": "!segments.empty()",
    "assert_file": "mds/MDLog.h",
    "assert_func": "LogSegment* MDLog::get_current_segment()",
    "assert_line": 99,
    "assert_msg": "mds/MDLog.h: In function 'LogSegment* MDLog::get_current_segment()' thread 7fc18059e700 time 2022-07-21T14:40:03.922236+0000\nmds/MDLog.h: 99: FAILED ceph_assert(!segments.empty())",
    "assert_thread_name": "ms_dispatch",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12ce0) [0x7fc187fb7ce0]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x7fc188fd6c32]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x283df5) [0x7fc188fd6df5]",
        "(Server::journal_close_session(Session*, int, Context*)+0x8d5) [0x558826e22a05]",
        "(Server::kill_session(Session*, Context*)+0x212) [0x558826e23012]",
        "(Server::apply_blocklist()+0x10d) [0x558826e232cd]",
        "(MDSRank::apply_blocklist(std::set<entity_addr_t, std::less<entity_addr_t>, std::allocator<entity_addr_t> > const&, unsigned int)+0x34) [0x558826ddf024]",
        "(MDSRankDispatcher::handle_osd_map()+0xf6) [0x558826ddf366]",
        "(MDSDaemon::handle_core_message(boost::intrusive_ptr<Message const> const&)+0x39b) [0x558826dc85bb]",
        "(MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0xc3) [0x558826dc8f73]",
        "(DispatchQueue::entry()+0x14fa) [0x7fc18925d43a]",
        "(DispatchQueue::DispatchThread::entry()+0x11) [0x7fc189314581]",
        "/lib64/libpthread.so.0(+0x81ca) [0x7fc187fad1ca]",
        "clone()" 
    ],
    "ceph_version": "17.2.1",
    "crash_id": "2022-07-21T14:40:03.929752Z_e2ff8687-eb24-49a0-9f80-caba2d348573",
    "entity_name": "mds.022ecb20699777119741767351c32be151715ba3",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mds",
    "stack_sig": "519f0510e677d0f076d7e4dd916d86cea61d0049a67f803e64ed6ef1f8814409",
    "timestamp": "2022-07-21T14:40:03.929752Z",
    "utsname_machine": "x86_64",
    "utsname_release": "5.15.52-1-lts",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Sat, 02 Jul 2022 20:04:03 +0000" 
}

Actions

Copy link

Updated by Telemetry Bot over 1 year ago

Crash signature (v1) updated (diff)
Crash signature (v2) updated (diff)
Affected Versions v17.2.1 added

Actions

Copy link

Updated by Venky Shankar over 1 year ago

Category set to Correctness/Safety
Assignee set to Kotresh Hiremath Ravishankar
Target version set to v18.0.0
Backport set to pacific,quincy
Severity changed from 3 - minor to 2 - major
Component(FS) MDS added
Labels (FS) crash added

Looks similar to https://tracker.ceph.com/issues/51589 which was fixed a while ago.

Kotresh, please RCA this.

Actions

Copy link

Updated by Kotresh Hiremath Ravishankar over 1 year ago

Status changed from New to In Progress
Priority changed from Normal to Low

This seems to be fixed by the PR https://github.com/ceph/ceph/pull/46833
Unfortunately, we don't have mds logs associated with this crash to know more about the mds state during which the crash happened.
After looking into the code, I think mds should have been in 'standby_replay' mode. The rationale is that the replay_thread
initializes the journal segment when the mds moves to 'standby_replay' from the 'standby'. There could be a race before this journal
segment is initialized, osd_map to journal the blocklisting of clients might have come. I tried reproducing this scenario by
instrumenting the code but couldn't. I might be missing something in reproducing steps here. The PR https://github.com/ceph/ceph/pull/46833
doesn't allow the journal of blocklisting events if the mds is in 'standby_replay' or any replay state. This should solve the problem.

So lowering the prority of the tracker and waiting to see if it happens in 17.2.4 which has the above mentioned fix.

Actions

Copy link

Updated by Patrick Donnelly 7 months ago

Target version deleted (~~v18.0.0~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #56808

crash: LogSegment* MDLog::get_current_segment(): assert(!segments.empty())

Updated by Telemetry Bot over 1 year ago

Updated by Venky Shankar over 1 year ago

Updated by Kotresh Hiremath Ravishankar over 1 year ago

Updated by Patrick Donnelly 7 months ago