Project

General

Profile

Bug #65225

Updated by Abhishek Lekshmanan about 1 month ago

In a workload that is heavily hardlinking and moving files, we see ceph-mds assert like the following  

 <pre> 
 16.2.9/src/mds/Locker.cc: In function 'bool Locker::acquire_locks(MDRequestRef&, Mutati> /var/log/ceph/ceph-mds.minilevinson-983db664a3.log-20240329.gz 
 ceph-mds[2693077]: /builddir/build/BUILD/ceph-16.2.9/src/mds/Locker.cc: 404: FAILED ceph_assert(dn->get_projected_linkage()->is_remot> 
 ceph-mds[2693077]:    ceph version 16.2.9-2 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable) 
 ceph-mds[2693077]:    1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f40e912ed98] 
 ceph-mds[2693077]:    2: /usr/lib64/ceph/libceph-common.so.2(+0x276fb2) [0x7f40e912efb2] 
 ceph-mds[2693077]:    3: (Locker::acquire_locks(boost::intrusive_ptr<MDRequestImpl>&, MutationImpl::LockOpVec&, CInode*, bool)+0x131e) > 
 ceph-mds[2693077]:    4: (Server::handle_client_unlink(boost::intrusive_ptr<MDRequestImpl>&)+0xb43) [0x55eeadce4f73] 
 ceph-mds[2693077]:    5: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xf1b) [0x55eeadd12d3b] 
 ceph-mds[2693077]:    6: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x3fc) [0x55eeadd1348c] 
 ceph-mds[2693077]:    7: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x12b) [0x55eeadd179fb] 
 ceph-mds[2693077]:    8: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0xbb4) [0x55eeadc6e224] 
 ceph-mds[2693077]:    9: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7bb) [0x55eeadc70bdb] 
 ceph-mds[2693077]:    10: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x55) [0x55eeadc711d5] 
 ceph-mds[2693077]:    11: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x55eeadc60dc8] 
 ceph-mds[2693077]:    12: (DispatchQueue::entry()+0x126a) [0x7f40e9374b5a] 
 ceph-mds[2693077]:    13: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f40e9426841] 
 ceph-mds[2693077]:    14: /lib64/libpthread.so.0(+0x81ca) [0x7f40e810f1ca] 
 </pre> 

 The full logs look like the following 
 <pre> 
 -1 /builddir/build/BUILD/ceph-16.2.9/src/mds/Locker.cc: In function 'bool Locker::acquire_locks(MDRequestRef&, MutationImpl::LockOpV 
 ec&, CInode*, bool)' thread 7f40e06ef700 time 2024-03-28T19:20:43.701971+0100 
 /builddir/build/BUILD/ceph-16.2.9/src/mds/Locker.cc: 404: FAILED ceph_assert(dn->get_projected_linkage()->is_remote()) 

  ceph version 16.2.9-2 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable) 
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f40e912ed98] 
  2: /usr/lib64/ceph/libceph-common.so.2(+0x276fb2) [0x7f40e912efb2] 
  3: (Locker::acquire_locks(boost::intrusive_ptr<MDRequestImpl>&, MutationImpl::LockOpVec&, CInode*, bool)+0x131e) [0x55eeade5669e] 
  4: (Server::handle_client_unlink(boost::intrusive_ptr<MDRequestImpl>&)+0xb43) [0x55eeadce4f73] 
  5: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xf1b) [0x55eeadd12d3b] 
  6: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x3fc) [0x55eeadd1348c] 
  7: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x12b) [0x55eeadd179fb] 
  8: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0xbb4) [0x55eeadc6e224] 
  9: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7bb) [0x55eeadc70bdb] 
  10: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x55) [0x55eeadc711d5] 
  11: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x55eeadc60dc8] 
  12: (DispatchQueue::entry()+0x126a) [0x7f40e9374b5a] 
  13: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f40e9426841] 
  14: /lib64/libpthread.so.0(+0x81ca) [0x7f40e810f1ca] 
  15: clone() 

      0> 2024-03-28T19:20:43.709+0100 7f40e06ef700 -1 *** Caught signal (Aborted) ** 
  in thread 7f40e06ef700 thread_name:ms_dispatch 

  ceph version 16.2.9-2 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable) 
  1: /lib64/libpthread.so.0(+0x12cf0) [0x7f40e8119cf0] 
  2: gsignal() 
  3: abort() 
  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f40e912ede9] 
  5: /usr/lib64/ceph/libceph-common.so.2(+0x276fb2) [0x7f40e912efb2] 
  6: (Locker::acquire_locks(boost::intrusive_ptr<MDRequestImpl>&, MutationImpl::LockOpVec&, CInode*, bool)+0x131e) [0x55eeade5669e] 
  7: (Server::handle_client_unlink(boost::intrusive_ptr<MDRequestImpl>&)+0xb43) [0x55eeadce4f73] 
  8: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xf1b) [0x55eeadd12d3b] 
  9: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x3fc) [0x55eeadd1348c] 
  10: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x12b) [0x55eeadd179fb] 
  11: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0xbb4) [0x55eeadc6e224] 
  12: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7bb) [0x55eeadc70bdb] 
  13: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x55) [0x55eeadc711d5] 
  14: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x55eeadc60dc8] 
  15: (DispatchQueue::entry()+0x126a) [0x7f40e9374b5a] 
  16: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f40e9426841] 
  17: /lib64/libpthread.so.0(+0x81ca) [0x7f40e810f1ca] 
  18: clone() 
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 
 </pre> 

 Some interesting log snippets from before  
 <pre> 
   -434> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2024-03-28T19:18:03.483427+0100) 
   -433> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 log_client    log_queue is 6 last_log 1380 sent 1374 num 6 unsent 6 sending 6 
   -432> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 log_client    will send 2024-03-28T19:18:32.701758+0100 mds.minilevinson-983db664a3 (mds.0) 1375 : cluster [WRN] 5 slow requests, 5 
  included below; oldest blocked for > 13.152129 secs 
   -431> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 log_client    will send 2024-03-28T19:18:32.701765+0100 mds.minilevinson-983db664a3 (mds.0) 1376 : cluster [WRN] slow request 7.203 
 507 seconds old, received at 2024-03-28T19:18:25.497610+0100: client_request(mds.0:181210 rename #0x100026f338d/sh15637 #0x606/1000295a11a caller_uid=0, caller_gid=0{}) currently sub 
 mit entry: journal_and_reply 
   -430> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 log_client    will send 2024-03-28T19:18:32.701771+0100 mds.minilevinson-983db664a3 (mds.0) 1377 : cluster [WRN] slow request 6.519 
 057 seconds old, received at 2024-03-28T19:18:26.182061+0100: client_request(mds.0:181239 rename #0x100026f338d/sh15811 #0x608/1000295a58f caller_uid=0, caller_gid=0{}) currently sub 
 mit entry: journal_and_reply 
   -429> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 log_client    will send 2024-03-28T19:18:32.701774+0100 mds.minilevinson-983db664a3 (mds.0) 1378 : cluster [WRN] slow request 6.372 
 089 seconds old, received at 2024-03-28T19:18:26.329029+0100: client_request(mds.0:181300 rename #0x1000185ba3c/file117_hardmv_hard #0x603/10002567d82 caller_uid=0, caller_gid=0{}) c 
 urrently failed to wrlock, waiting 
   -428> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 log_client    will send 2024-03-28T19:18:32.701776+0100 mds.minilevinson-983db664a3 (mds.0) 1379 : cluster [WRN] slow request 6.371 
 098 seconds old, received at 2024-03-28T19:18:26.330020+0100: client_request(mds.0:181320 rename #0x1000185ba3c/file373_hardmv_hard #0x608/1000256807c caller_uid=0, caller_gid=0{}) c 
 urrently failed to wrlock, waiting 
   -427> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 log_client    will send 2024-03-28T19:18:32.701781+0100 mds.minilevinson-983db664a3 (mds.0) 1380 : cluster [WRN] slow request 5.565 
 985 seconds old, received at 2024-03-28T19:18:27.135132+0100: client_request(mds.0:181349 rename #0x100026f338d/sh15708 #0x606/1000295a111 caller_uid=0, caller_gid=0{}) currently fai 
 led to wrlock, waiting 
   -426> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 monclient: _send_mon_message to mon.minilevinson-f52cbe8096 at v2:188.184.88.255:3300/0 
   -425> 2024-03-28T19:18:33.765+0100 7f40e06ef700 10 log_client handle_log_ack log(last 1380) v1 
   -424> 2024-03-28T19:18:33.765+0100 7f40e06ef700 10 log_client    logged 2024-03-28T19:18:32.701758+0100 mds.minilevinson-983db664a3 (mds.0) 1375 : cluster [WRN] 5 slow requests, 5 in 
 cluded below; oldest blocked for > 13.152129 secs 
   -423> 2024-03-28T19:18:33.765+0100 7f40e06ef700 10 log_client    logged 2024-03-28T19:18:32.701765+0100 mds.minilevinson-983db664a3 (mds.0) 1376 : cluster [WRN] slow request 7.203507 
  seconds old, received at 2024-03-28T19:18:25.497610+0100: client_request(mds.0:181210 rename #0x100026f338d/sh15637 #0x606/1000295a11a caller_uid=0, caller_gid=0{}) currently submit 
  entry: journal_and_reply 
   -422> 2024-03-28T19:18:33.765+0100 7f40e06ef700 10 log_client    logged 2024-03-28T19:18:32.701771+0100 mds.minilevinson-983db664a3 (mds.0) 1377 : cluster [WRN] slow request 6.519057 
  seconds old, received at 2024-03-28T19:18:26.182061+0100: client_request(mds.0:181239 rename #0x100026f338d/sh15811 #0x608/1000295a58f caller_uid=0, caller_gid=0{}) currently submit 
  entry: journal_and_reply 
   -421> 2024-03-28T19:18:33.765+0100 7f40e06ef700 10 log_client    logged 2024-03-28T19:18:32.701774+0100 mds.minilevinson-983db664a3 (mds.0) 1378 : cluster [WRN] slow request 6.372089 
  seconds old, received at 2024-03-28T19:18:26.329029+0100: client_request(mds.0:181300 rename #0x1000185ba3c/file117_hardmv_hard #0x603/10002567d82 caller_uid=0, caller_gid=0{}) curr 
 ently failed to wrlock, waiting 
   -420> 2024-03-28T19:18:33.765+0100 7f40e06ef700 10 log_client    logged 2024-03-28T19:18:32.701776+0100 mds.minilevinson-983db664a3 (mds.0) 1379 : cluster [WRN] slow request 6.371098 
  seconds old, received at 2024-03-28T19:18:26.330020+0100: client_request(mds.0:181320 rename #0x1000185ba3c/file373_hardmv_hard #0x608/1000256807c caller_uid=0, caller_gid=0{}) curr 
 ently failed to wrlock, waiting 
   -419> 2024-03-28T19:18:33.765+0100 7f40e06ef700 10 log_client    logged 2024-03-28T19:18:32.701781+0100 mds.minilevinson-983db664a3 (mds.0) 1380 : cluster [WRN] slow request 5.565985 
  seconds old, received at 2024-03-28T19:18:27.135132+0100: client_request(mds.0:181349 rename #0x100026f338d/sh15708 #0x606/1000295a111 caller_uid=0, caller_gid=0{}) currently failed 
  to wrlock, waiting 
 </pre> 

 We've configured the mds `mds_op_complaint_time` to 5s in order to track some potential slow locks

Back