Actions
Bug #51589
closedmds: crash when journaling during replay
% Done:
0%
Source:
Community (user)
Tags:
backport_processed
Backport:
pacific,octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
MDS version: ceph version 14.2.20 (36274af6eb7f2a5055f2d53ad448f2694e9046a0) nautilus (stable)
Using 200 clients, mds crashed after writing for many days.
But I don’t know what caused the mds to crash.
[twj@xxxxxxxxx-MN-001.sn.cn ~]$ sudo ceph fs status cephfs - 200 clients ====== +------+----------------+------------------------+----------+-------+-------+ | Rank | State | MDS | Activity | dns | inos | +------+----------------+------------------------+----------+-------+-------+ | 0 | resolve | xxxxxxxxxxMN-002.sn.cn | | 0 | 3 | | 1 | resolve(laggy) | xxxxxxxxxxMN-003.sn.cn | | 0 | 0 | +------+----------------+------------------------+----------+-------+-------+ +----------------------+----------+-------+-------+ | Pool | type | used | avail | +----------------------+----------+-------+-------+ | cephfs.metadata.pool | metadata | 70.5G | 793G | | cephfs.data.pool1 | data | 183T | 1115T | | cephfs.data.pool2 | data | 299T | 1042T | +----------------------+----------+-------+-------+ +-------------+ | Standby MDS | +-------------+ +-------------+ MDS version: ceph version 14.2.20 (36274af6eb7f2a5055f2d53ad448f2694e9046a0) nautilus (stable)
All mds crashed for this reason:
-1> 2021-07-08 15:14:13.283 7f3804255700 -1 /builddir/build/BUILD/ceph-14.2.20/src/mds/MDLog.cc: In function 'void MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)' thread 7f3804255700 time 2021-07-08 15:14:13.283719 /builddir/build/BUILD/ceph-14.2.20/src/mds/MDLog.cc: 288: FAILED ceph_assert(!segments.empty()) ceph version 14.2.20 (36274af6eb7f2a5055f2d53ad448f2694e9046a0) nautilus (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7f380d72cfe7] 2: (()+0x25d1af) [0x7f380d72d1af] 3: (MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)+0x599) [0x557471ec5959] 4: (Server::journal_close_session(Session*, int, Context*)+0x9ed) [0x557471c7e02d] 5: (Server::kill_session(Session*, Context*)+0x234) [0x557471c81914] 6: (Server::apply_blacklist(std::set<entity_addr_t, std::less<entity_addr_t>, std::allocator<entity_addr_t> > const&)+0x14d) [0x557471c8449d] 7: (MDSRank::reconnect_start()+0xcf) [0x557471c49c5f] 8: (MDSRankDispatcher::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&, MDSMap const&)+0x1c29) [0x557471c57979] 9: (MDSDaemon::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&)+0xa9b) [0x557471c3091b] 10: (MDSDaemon::handle_core_message(boost::intrusive_ptr<Message const> const&)+0xed) [0x557471c3216d] 11: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0xc3) [0x557471c32983] 12: (DispatchQueue::entry()+0x1699) [0x7f380d952b79] 13: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f380da008ed] 14: (()+0x7ea5) [0x7f380b5eeea5] 15: (clone()+0x6d) [0x7f380a29e96d]
Actions