Bug #56808
opencrash: LogSegment* MDLog::get_current_segment(): assert(!segments.empty())
0%
519f0510e677d0f076d7e4dd916d86cea61d0049a67f803e64ed6ef1f8814409
Description
Assert condition: !segments.empty()
Assert function: LogSegment* MDLog::get_current_segment()
Sanitized backtrace:
Server::journal_close_session(Session*, int, Context*) Server::kill_session(Session*, Context*) Server::apply_blocklist() MDSRank::apply_blocklist(std::set<entity_addr_t, std::less<entity_addr_t>, std::allocator<entity_addr_t> > const&, unsigned int) MDSRankDispatcher::handle_osd_map() MDSDaemon::handle_core_message(boost::intrusive_ptr<Message const> const&) MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&) DispatchQueue::entry() DispatchQueue::DispatchThread::entry()
Crash dump sample:
{ "assert_condition": "!segments.empty()", "assert_file": "mds/MDLog.h", "assert_func": "LogSegment* MDLog::get_current_segment()", "assert_line": 99, "assert_msg": "mds/MDLog.h: In function 'LogSegment* MDLog::get_current_segment()' thread 7fc18059e700 time 2022-07-21T14:40:03.922236+0000\nmds/MDLog.h: 99: FAILED ceph_assert(!segments.empty())", "assert_thread_name": "ms_dispatch", "backtrace": [ "/lib64/libpthread.so.0(+0x12ce0) [0x7fc187fb7ce0]", "gsignal()", "abort()", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x7fc188fd6c32]", "/usr/lib64/ceph/libceph-common.so.2(+0x283df5) [0x7fc188fd6df5]", "(Server::journal_close_session(Session*, int, Context*)+0x8d5) [0x558826e22a05]", "(Server::kill_session(Session*, Context*)+0x212) [0x558826e23012]", "(Server::apply_blocklist()+0x10d) [0x558826e232cd]", "(MDSRank::apply_blocklist(std::set<entity_addr_t, std::less<entity_addr_t>, std::allocator<entity_addr_t> > const&, unsigned int)+0x34) [0x558826ddf024]", "(MDSRankDispatcher::handle_osd_map()+0xf6) [0x558826ddf366]", "(MDSDaemon::handle_core_message(boost::intrusive_ptr<Message const> const&)+0x39b) [0x558826dc85bb]", "(MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0xc3) [0x558826dc8f73]", "(DispatchQueue::entry()+0x14fa) [0x7fc18925d43a]", "(DispatchQueue::DispatchThread::entry()+0x11) [0x7fc189314581]", "/lib64/libpthread.so.0(+0x81ca) [0x7fc187fad1ca]", "clone()" ], "ceph_version": "17.2.1", "crash_id": "2022-07-21T14:40:03.929752Z_e2ff8687-eb24-49a0-9f80-caba2d348573", "entity_name": "mds.022ecb20699777119741767351c32be151715ba3", "os_id": "centos", "os_name": "CentOS Stream", "os_version": "8", "os_version_id": "8", "process_name": "ceph-mds", "stack_sig": "519f0510e677d0f076d7e4dd916d86cea61d0049a67f803e64ed6ef1f8814409", "timestamp": "2022-07-21T14:40:03.929752Z", "utsname_machine": "x86_64", "utsname_release": "5.15.52-1-lts", "utsname_sysname": "Linux", "utsname_version": "#1 SMP Sat, 02 Jul 2022 20:04:03 +0000" }
Updated by Telemetry Bot over 1 year ago
Updated by Venky Shankar over 1 year ago
- Category set to Correctness/Safety
- Assignee set to Kotresh Hiremath Ravishankar
- Target version set to v18.0.0
- Backport set to pacific,quincy
- Severity changed from 3 - minor to 2 - major
- Component(FS) MDS added
- Labels (FS) crash added
Looks similar to https://tracker.ceph.com/issues/51589 which was fixed a while ago.
Kotresh, please RCA this.
Updated by Kotresh Hiremath Ravishankar over 1 year ago
- Status changed from New to In Progress
- Priority changed from Normal to Low
This seems to be fixed by the PR https://github.com/ceph/ceph/pull/46833
Unfortunately, we don't have mds logs associated with this crash to know more about the mds state during which the crash happened.
After looking into the code, I think mds should have been in 'standby_replay' mode. The rationale is that the replay_thread
initializes the journal segment when the mds moves to 'standby_replay' from the 'standby'. There could be a race before this journal
segment is initialized, osd_map to journal the blocklisting of clients might have come. I tried reproducing this scenario by
instrumenting the code but couldn't. I might be missing something in reproducing steps here. The PR https://github.com/ceph/ceph/pull/46833
doesn't allow the journal of blocklisting events if the mds is in 'standby_replay' or any replay state. This should solve the problem.
So lowering the prority of the tracker and waiting to see if it happens in 17.2.4 which has the above mentioned fix.