mds: assertion in MDSRank::validate_sessions
This function is meant to make the MDS more resilient by killing any client sessions that have prealloc_inos that are inconsistent with the inotable. We only touch this path (and see this crash) if the metadata is already inconsistent.
However, it's getting called in the MDS_BOOT_REPLAY_DONE path, which happens while the MDS is still in the replay state, so when it tries to kill a session (which involves writing to mdlog), we assert out (from ceph-users thread "[ceph-users] MDS Bug/Problem" today):
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x555879e88942] > 2: (MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)+0x567) [0x555879ddd4e7] > 3: (Server::journal_close_session(Session*, int, Context*)+0x963) [0x555879b8d603] > 4: (Server::kill_session(Session*, Context*)+0x1fd) [0x555879b8e56d] > 5: (MDSRank::validate_sessions()+0x2dc) [0x555879b4c2dc] > 6: (MDSRank::boot_start(MDSRank::BootStep, int)+0xc08) [0x555879b4d228] > 7: (MDSInternalContextBase::complete(int)+0x18b) [0x555879dc56db] > 8: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::sub_finish(MDSInternalContextBase*, int)+0x127) [0x555879b66627] > 9: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::complete(int)+0x21) [0x555879b668d1] > 10: (MDLog::_replay_thread()+0x43c) [0x555879ddac8c] > 11: (MDLog::ReplayThread::entry()+0xd) [0x555879b56fcd] > 12: (()+0x76ba) [0x7fd57f3f06ba] > 13: (clone()+0x6d) [0x7fd57e45c41d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
#2 Updated by Patrick Donnelly 9 months ago
- Subject changed from Assertion in MDSRank::validate_sessions to mds: assertion in MDSRank::validate_sessions
- Assignee set to Zheng Yan
- Target version set to v13.0.0
- Source set to Community (user)
- Tags set to crash
- Backport set to luminous
- Component(FS) MDS added