Project

General

Profile

Bug #52280

Updated by Patrick Donnelly over 2 years ago

Hi All 
 We have nautilus 14.2.7, cluster with 3 MDs.  
 Sometimes, during heavy loads of kubernetese pods, the MDs keep restarting and fail on MDCache::add_inode 

 On one of our setups that this crash happened , we also noticed that the size of cephfs_metadata was big,1.3TB. 

 stack trace from mds log file: 

 <pre> 
 


 E/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/mds/MDCache.cc: In function 'void MDCache::add_inode(CInode*)' thread 7f657ec45700 time 2021-08-16 15:14:11.438857 
 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/mds/MDCache.cc: 268: FAILED ceph_assert(!p) 

  ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable) 
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7f658816b031] 
  2: (()+0x2661f9) [0x7f658816b1f9] 
  3: (()+0x20aeee) [0x5588cc076eee] 
  4: (Server::prepare_new_inode(boost::intrusive_ptr<MDRequestImpl>&, CDir*, inodeno_t, unsigned int, file_layout_t*)+0x2a4) [0x5588cc00a054] 
  5: (Server::handle_client_openc(boost::intrusive_ptr<MDRequestImpl>&)+0xcf1) [0x5588cc019da1] 
  6: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xb5b) [0x5588cc040bbb] 
  7: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x308) [0x5588cc041048] 
  8: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x122) [0x5588cc04cb02] 
  9: (MDSRank::handle_deferrable_message(boost::intrusive_ptr<Message const> const&)+0x6dc) [0x5588cbfc315c] 
  10: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7fa) [0x5588cbfc55ca] 
  11: (MDSRank::retry_dispatch(boost::intrusive_ptr<Message const> const&)+0x12) [0x5588cbfc5c12] 
  12: (MDSContext::complete(int)+0x74) [0x5588cc232b14] 
  13: (MDSRank::_advance_queues()+0xa4) [0x5588cbfc4634] 
  14: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x1d8) [0x5588cbfc4fa8] 
  15: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x40) [0x5588cbfc5b50] 
  16: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x5588cbfb3078] 
  17: (DispatchQueue::entry()+0x1709) [0x7f65883819d9] 
  18: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f658842e9cd] 
  19: (()+0x7e65) [0x7f6586018e65] 
  20: (clone()+0x6d) [0x7f6584cc688d] 

      0> 2021-08-16 15:14:11.441 7f657ec45700 -1 *** Caught signal (Aborted) ** 
  in thread 7f657ec45700 thread_name:ms_dispatch 


 ceph df  


 RAW STORAGE: 
     CLASS       SIZE          AVAIL         USED          RAW USED       %RAW USED 
     ssd         8.7 TiB       5.0 TiB       3.8 TiB        3.8 TiB           43.29 
     TOTAL       8.7 TiB       5.0 TiB       3.8 TiB        3.8 TiB           43.29 

 POOLS: 
     POOL                            ID       STORED        OBJECTS       USED          %USED       MAX AVAIL 
     cephfs_data                      1       246 GiB       591.31k       499 GiB       13.36         1.6 TiB 
     cephfs_metadata                  2       1.5 TiB       561.84k       3.0 TiB       48.69         1.6 TiB 
     default.rgw.meta                 3           0 B             0           0 B           0         1.6 TiB 
     .rgw.root                        4       3.5 KiB             8       256 KiB           0         1.6 TiB 
     default.rgw.buckets.index        5           0 B             0           0 B           0         1.6 TiB 
     default.rgw.control              6           0 B             8           0 B           0         1.6 TiB 
     default.rgw.buckets.data         7           0 B             0           0 B           0         1.6 TiB 
     default.rgw.log                  8           0 B           207           0 B           0         1.6 TiB 
     volumes                          9       141 GiB        57.69k       282 GiB        8.01         1.6 TiB 
     backups                         10           0 B             0           0 B           0         1.6 TiB 
     metrics                         11           0 B             0           0 B           0         1.6 TiB 

 </pre>

Back