Project

General

Profile

Actions

Bug #65225

open

ceph_assert on dn->get_projected_linkage()->is_remote

Added by Abhishek Lekshmanan 30 days ago. Updated 27 days ago.

Status:
Triaged
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
Backport:
quincy,reef,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In a workload that is heavily hardlinking and moving files, we see ceph-mds assert like the following

16.2.9/src/mds/Locker.cc: In function 'bool Locker::acquire_locks(MDRequestRef&, Mutati> /var/log/ceph/ceph-mds.minilevinson-983db664a3.log-20240329.gz
ceph-mds[2693077]: /builddir/build/BUILD/ceph-16.2.9/src/mds/Locker.cc: 404: FAILED ceph_assert(dn->get_projected_linkage()->is_remot>
ceph-mds[2693077]:  ceph version 16.2.9-2 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)
ceph-mds[2693077]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f40e912ed98]
ceph-mds[2693077]:  2: /usr/lib64/ceph/libceph-common.so.2(+0x276fb2) [0x7f40e912efb2]
ceph-mds[2693077]:  3: (Locker::acquire_locks(boost::intrusive_ptr<MDRequestImpl>&, MutationImpl::LockOpVec&, CInode*, bool)+0x131e) >
ceph-mds[2693077]:  4: (Server::handle_client_unlink(boost::intrusive_ptr<MDRequestImpl>&)+0xb43) [0x55eeadce4f73]
ceph-mds[2693077]:  5: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xf1b) [0x55eeadd12d3b]
ceph-mds[2693077]:  6: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x3fc) [0x55eeadd1348c]
ceph-mds[2693077]:  7: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x12b) [0x55eeadd179fb]
ceph-mds[2693077]:  8: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0xbb4) [0x55eeadc6e224]
ceph-mds[2693077]:  9: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7bb) [0x55eeadc70bdb]
ceph-mds[2693077]:  10: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x55) [0x55eeadc711d5]
ceph-mds[2693077]:  11: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x55eeadc60dc8]
ceph-mds[2693077]:  12: (DispatchQueue::entry()+0x126a) [0x7f40e9374b5a]
ceph-mds[2693077]:  13: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f40e9426841]
ceph-mds[2693077]:  14: /lib64/libpthread.so.0(+0x81ca) [0x7f40e810f1ca]

The full logs look like the following

-1 /builddir/build/BUILD/ceph-16.2.9/src/mds/Locker.cc: In function 'bool Locker::acquire_locks(MDRequestRef&, MutationImpl::LockOpV
ec&, CInode*, bool)' thread 7f40e06ef700 time 2024-03-28T19:20:43.701971+0100
/builddir/build/BUILD/ceph-16.2.9/src/mds/Locker.cc: 404: FAILED ceph_assert(dn->get_projected_linkage()->is_remote())

 ceph version 16.2.9-2 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f40e912ed98]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x276fb2) [0x7f40e912efb2]
 3: (Locker::acquire_locks(boost::intrusive_ptr<MDRequestImpl>&, MutationImpl::LockOpVec&, CInode*, bool)+0x131e) [0x55eeade5669e]
 4: (Server::handle_client_unlink(boost::intrusive_ptr<MDRequestImpl>&)+0xb43) [0x55eeadce4f73]
 5: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xf1b) [0x55eeadd12d3b]
 6: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x3fc) [0x55eeadd1348c]
 7: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x12b) [0x55eeadd179fb]
 8: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0xbb4) [0x55eeadc6e224]
 9: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7bb) [0x55eeadc70bdb]
 10: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x55) [0x55eeadc711d5]
 11: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x55eeadc60dc8]
 12: (DispatchQueue::entry()+0x126a) [0x7f40e9374b5a]
 13: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f40e9426841]
 14: /lib64/libpthread.so.0(+0x81ca) [0x7f40e810f1ca]
 15: clone()

     0> 2024-03-28T19:20:43.709+0100 7f40e06ef700 -1 *** Caught signal (Aborted) **
 in thread 7f40e06ef700 thread_name:ms_dispatch

 ceph version 16.2.9-2 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f40e8119cf0]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f40e912ede9]
 5: /usr/lib64/ceph/libceph-common.so.2(+0x276fb2) [0x7f40e912efb2]
 6: (Locker::acquire_locks(boost::intrusive_ptr<MDRequestImpl>&, MutationImpl::LockOpVec&, CInode*, bool)+0x131e) [0x55eeade5669e]
 7: (Server::handle_client_unlink(boost::intrusive_ptr<MDRequestImpl>&)+0xb43) [0x55eeadce4f73]
 8: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xf1b) [0x55eeadd12d3b]
 9: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x3fc) [0x55eeadd1348c]
 10: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x12b) [0x55eeadd179fb]
 11: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0xbb4) [0x55eeadc6e224]
 12: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7bb) [0x55eeadc70bdb]
 13: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x55) [0x55eeadc711d5]
 14: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x55eeadc60dc8]
 15: (DispatchQueue::entry()+0x126a) [0x7f40e9374b5a]
 16: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f40e9426841]
 17: /lib64/libpthread.so.0(+0x81ca) [0x7f40e810f1ca]
 18: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Some interesting log snippets from before

  -434> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2024-03-28T19:18:03.483427+0100)
  -433> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 log_client  log_queue is 6 last_log 1380 sent 1374 num 6 unsent 6 sending 6
  -432> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 log_client  will send 2024-03-28T19:18:32.701758+0100 mds.minilevinson-983db664a3 (mds.0) 1375 : cluster [WRN] 5 slow requests, 5
 included below; oldest blocked for > 13.152129 secs
  -431> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 log_client  will send 2024-03-28T19:18:32.701765+0100 mds.minilevinson-983db664a3 (mds.0) 1376 : cluster [WRN] slow request 7.203
507 seconds old, received at 2024-03-28T19:18:25.497610+0100: client_request(mds.0:181210 rename #0x100026f338d/sh15637 #0x606/1000295a11a caller_uid=0, caller_gid=0{}) currently sub
mit entry: journal_and_reply
  -430> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 log_client  will send 2024-03-28T19:18:32.701771+0100 mds.minilevinson-983db664a3 (mds.0) 1377 : cluster [WRN] slow request 6.519
057 seconds old, received at 2024-03-28T19:18:26.182061+0100: client_request(mds.0:181239 rename #0x100026f338d/sh15811 #0x608/1000295a58f caller_uid=0, caller_gid=0{}) currently sub
mit entry: journal_and_reply
  -429> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 log_client  will send 2024-03-28T19:18:32.701774+0100 mds.minilevinson-983db664a3 (mds.0) 1378 : cluster [WRN] slow request 6.372
089 seconds old, received at 2024-03-28T19:18:26.329029+0100: client_request(mds.0:181300 rename #0x1000185ba3c/file117_hardmv_hard #0x603/10002567d82 caller_uid=0, caller_gid=0{}) c
urrently failed to wrlock, waiting
  -428> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 log_client  will send 2024-03-28T19:18:32.701776+0100 mds.minilevinson-983db664a3 (mds.0) 1379 : cluster [WRN] slow request 6.371
098 seconds old, received at 2024-03-28T19:18:26.330020+0100: client_request(mds.0:181320 rename #0x1000185ba3c/file373_hardmv_hard #0x608/1000256807c caller_uid=0, caller_gid=0{}) c
urrently failed to wrlock, waiting
  -427> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 log_client  will send 2024-03-28T19:18:32.701781+0100 mds.minilevinson-983db664a3 (mds.0) 1380 : cluster [WRN] slow request 5.565
985 seconds old, received at 2024-03-28T19:18:27.135132+0100: client_request(mds.0:181349 rename #0x100026f338d/sh15708 #0x606/1000295a111 caller_uid=0, caller_gid=0{}) currently fai
led to wrlock, waiting
  -426> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 monclient: _send_mon_message to mon.minilevinson-f52cbe8096 at v2:188.184.88.255:3300/0
  -425> 2024-03-28T19:18:33.765+0100 7f40e06ef700 10 log_client handle_log_ack log(last 1380) v1
  -424> 2024-03-28T19:18:33.765+0100 7f40e06ef700 10 log_client  logged 2024-03-28T19:18:32.701758+0100 mds.minilevinson-983db664a3 (mds.0) 1375 : cluster [WRN] 5 slow requests, 5 in
cluded below; oldest blocked for > 13.152129 secs
  -423> 2024-03-28T19:18:33.765+0100 7f40e06ef700 10 log_client  logged 2024-03-28T19:18:32.701765+0100 mds.minilevinson-983db664a3 (mds.0) 1376 : cluster [WRN] slow request 7.203507
 seconds old, received at 2024-03-28T19:18:25.497610+0100: client_request(mds.0:181210 rename #0x100026f338d/sh15637 #0x606/1000295a11a caller_uid=0, caller_gid=0{}) currently submit
 entry: journal_and_reply
  -422> 2024-03-28T19:18:33.765+0100 7f40e06ef700 10 log_client  logged 2024-03-28T19:18:32.701771+0100 mds.minilevinson-983db664a3 (mds.0) 1377 : cluster [WRN] slow request 6.519057
 seconds old, received at 2024-03-28T19:18:26.182061+0100: client_request(mds.0:181239 rename #0x100026f338d/sh15811 #0x608/1000295a58f caller_uid=0, caller_gid=0{}) currently submit
 entry: journal_and_reply
  -421> 2024-03-28T19:18:33.765+0100 7f40e06ef700 10 log_client  logged 2024-03-28T19:18:32.701774+0100 mds.minilevinson-983db664a3 (mds.0) 1378 : cluster [WRN] slow request 6.372089
 seconds old, received at 2024-03-28T19:18:26.329029+0100: client_request(mds.0:181300 rename #0x1000185ba3c/file117_hardmv_hard #0x603/10002567d82 caller_uid=0, caller_gid=0{}) curr
ently failed to wrlock, waiting
  -420> 2024-03-28T19:18:33.765+0100 7f40e06ef700 10 log_client  logged 2024-03-28T19:18:32.701776+0100 mds.minilevinson-983db664a3 (mds.0) 1379 : cluster [WRN] slow request 6.371098
 seconds old, received at 2024-03-28T19:18:26.330020+0100: client_request(mds.0:181320 rename #0x1000185ba3c/file373_hardmv_hard #0x608/1000256807c caller_uid=0, caller_gid=0{}) curr
ently failed to wrlock, waiting
  -419> 2024-03-28T19:18:33.765+0100 7f40e06ef700 10 log_client  logged 2024-03-28T19:18:32.701781+0100 mds.minilevinson-983db664a3 (mds.0) 1380 : cluster [WRN] slow request 5.565985
 seconds old, received at 2024-03-28T19:18:27.135132+0100: client_request(mds.0:181349 rename #0x100026f338d/sh15708 #0x606/1000295a111 caller_uid=0, caller_gid=0{}) currently failed
 to wrlock, waiting

We've configured the mds `mds_op_complaint_time` to 5s in order to track some potential slow locks

Actions #1

Updated by Abhishek Lekshmanan 30 days ago

  • Description updated (diff)
  • Component(FS) MDS added
  • Labels (FS) crash added
Actions #2

Updated by Venky Shankar 27 days ago

  • Category set to Correctness/Safety
  • Assignee set to Venky Shankar
  • Target version set to v20.0.0
  • Backport set to quincy,reef,squid
Actions #3

Updated by Venky Shankar 27 days ago

  • Status changed from New to Triaged
Actions

Also available in: Atom PDF