Project

General

Profile

Bug #51589

Updated by Patrick Donnelly almost 3 years ago

MDS version: ceph version 14.2.20 (36274af6eb7f2a5055f2d53ad448f2694e9046a0) nautilus (stable) 

 Using 200 clients, mds crashed after writing for many days. 

 But I don’t know what caused the mds to crash. 

 <pre> 
 [twj@xxxxxxxxx-MN-001.sn.cn ~]$ sudo ceph fs status 
 cephfs - 200 clients 
 ====== 
 +------+----------------+------------------------+----------+-------+-------+ 
 | Rank |       State        |            MDS             | Activity |    dns    |    inos | 
 +------+----------------+------------------------+----------+-------+-------+ 
 |    0     |      resolve       | xxxxxxxxxxMN-002.sn.cn |            |      0    |      3    | 
 |    1     | resolve(laggy) | xxxxxxxxxxMN-003.sn.cn |            |      0    |      0    | 
 +------+----------------+------------------------+----------+-------+-------+ 
 +----------------------+----------+-------+-------+ 
 |           Pool           |     type     |    used | avail | 
 +----------------------+----------+-------+-------+ 
 | cephfs.metadata.pool | metadata | 70.5G |    793G | 
 |    cephfs.data.pool1     |     data     |    183T | 1115T | 
 |    cephfs.data.pool2     |     data     |    299T | 1042T | 
 +----------------------+----------+-------+-------+ 
 +-------------+ 
 | Standby MDS | 
 +-------------+ 
 +-------------+ 
 MDS version: ceph version 14.2.20 (36274af6eb7f2a5055f2d53ad448f2694e9046a0) nautilus (stable) 
 </pre> 

 


 All mds crashed for this reason: 

 <pre> 
     

     -1> 2021-07-08 15:14:13.283 7f3804255700 -1 /builddir/build/BUILD/ceph-14.2.20/src/mds/MDLog.cc: In function 'void MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)' thread 7f3804255700 time 2021-07-08 15:14:13.283719 
 /builddir/build/BUILD/ceph-14.2.20/src/mds/MDLog.cc: 288: FAILED ceph_assert(!segments.empty()) 

  ceph version 14.2.20 (36274af6eb7f2a5055f2d53ad448f2694e9046a0) nautilus (stable) 
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7f380d72cfe7] 
  2: (()+0x25d1af) [0x7f380d72d1af] 
  3: (MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)+0x599) [0x557471ec5959] 
  4: (Server::journal_close_session(Session*, int, Context*)+0x9ed) [0x557471c7e02d] 
  5: (Server::kill_session(Session*, Context*)+0x234) [0x557471c81914] 
  6: (Server::apply_blacklist(std::set<entity_addr_t, std::less<entity_addr_t>, std::allocator<entity_addr_t> > const&)+0x14d) [0x557471c8449d] 
  7: (MDSRank::reconnect_start()+0xcf) [0x557471c49c5f] 
  8: (MDSRankDispatcher::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&, MDSMap const&)+0x1c29) [0x557471c57979] 
  9: (MDSDaemon::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&)+0xa9b) [0x557471c3091b] 
  10: (MDSDaemon::handle_core_message(boost::intrusive_ptr<Message const> const&)+0xed) [0x557471c3216d] 
  11: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0xc3) [0x557471c32983] 
  12: (DispatchQueue::entry()+0x1699) [0x7f380d952b79] 
  13: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f380da008ed] 
  14: (()+0x7ea5) [0x7f380b5eeea5] 
  15: (clone()+0x6d) [0x7f380a29e96d] 
 </pre>

Back