Actions
Bug #65225
openceph_assert on dn->get_projected_linkage()->is_remote
Status:
Triaged
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:
0%
Source:
Tags:
Backport:
quincy,reef,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
In a workload that is heavily hardlinking and moving files, we see ceph-mds assert like the following
16.2.9/src/mds/Locker.cc: In function 'bool Locker::acquire_locks(MDRequestRef&, Mutati> /var/log/ceph/ceph-mds.minilevinson-983db664a3.log-20240329.gz ceph-mds[2693077]: /builddir/build/BUILD/ceph-16.2.9/src/mds/Locker.cc: 404: FAILED ceph_assert(dn->get_projected_linkage()->is_remot> ceph-mds[2693077]: ceph version 16.2.9-2 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable) ceph-mds[2693077]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f40e912ed98] ceph-mds[2693077]: 2: /usr/lib64/ceph/libceph-common.so.2(+0x276fb2) [0x7f40e912efb2] ceph-mds[2693077]: 3: (Locker::acquire_locks(boost::intrusive_ptr<MDRequestImpl>&, MutationImpl::LockOpVec&, CInode*, bool)+0x131e) > ceph-mds[2693077]: 4: (Server::handle_client_unlink(boost::intrusive_ptr<MDRequestImpl>&)+0xb43) [0x55eeadce4f73] ceph-mds[2693077]: 5: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xf1b) [0x55eeadd12d3b] ceph-mds[2693077]: 6: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x3fc) [0x55eeadd1348c] ceph-mds[2693077]: 7: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x12b) [0x55eeadd179fb] ceph-mds[2693077]: 8: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0xbb4) [0x55eeadc6e224] ceph-mds[2693077]: 9: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7bb) [0x55eeadc70bdb] ceph-mds[2693077]: 10: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x55) [0x55eeadc711d5] ceph-mds[2693077]: 11: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x55eeadc60dc8] ceph-mds[2693077]: 12: (DispatchQueue::entry()+0x126a) [0x7f40e9374b5a] ceph-mds[2693077]: 13: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f40e9426841] ceph-mds[2693077]: 14: /lib64/libpthread.so.0(+0x81ca) [0x7f40e810f1ca]
The full logs look like the following
-1 /builddir/build/BUILD/ceph-16.2.9/src/mds/Locker.cc: In function 'bool Locker::acquire_locks(MDRequestRef&, MutationImpl::LockOpV ec&, CInode*, bool)' thread 7f40e06ef700 time 2024-03-28T19:20:43.701971+0100 /builddir/build/BUILD/ceph-16.2.9/src/mds/Locker.cc: 404: FAILED ceph_assert(dn->get_projected_linkage()->is_remote()) ceph version 16.2.9-2 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f40e912ed98] 2: /usr/lib64/ceph/libceph-common.so.2(+0x276fb2) [0x7f40e912efb2] 3: (Locker::acquire_locks(boost::intrusive_ptr<MDRequestImpl>&, MutationImpl::LockOpVec&, CInode*, bool)+0x131e) [0x55eeade5669e] 4: (Server::handle_client_unlink(boost::intrusive_ptr<MDRequestImpl>&)+0xb43) [0x55eeadce4f73] 5: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xf1b) [0x55eeadd12d3b] 6: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x3fc) [0x55eeadd1348c] 7: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x12b) [0x55eeadd179fb] 8: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0xbb4) [0x55eeadc6e224] 9: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7bb) [0x55eeadc70bdb] 10: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x55) [0x55eeadc711d5] 11: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x55eeadc60dc8] 12: (DispatchQueue::entry()+0x126a) [0x7f40e9374b5a] 13: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f40e9426841] 14: /lib64/libpthread.so.0(+0x81ca) [0x7f40e810f1ca] 15: clone() 0> 2024-03-28T19:20:43.709+0100 7f40e06ef700 -1 *** Caught signal (Aborted) ** in thread 7f40e06ef700 thread_name:ms_dispatch ceph version 16.2.9-2 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable) 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f40e8119cf0] 2: gsignal() 3: abort() 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f40e912ede9] 5: /usr/lib64/ceph/libceph-common.so.2(+0x276fb2) [0x7f40e912efb2] 6: (Locker::acquire_locks(boost::intrusive_ptr<MDRequestImpl>&, MutationImpl::LockOpVec&, CInode*, bool)+0x131e) [0x55eeade5669e] 7: (Server::handle_client_unlink(boost::intrusive_ptr<MDRequestImpl>&)+0xb43) [0x55eeadce4f73] 8: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xf1b) [0x55eeadd12d3b] 9: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x3fc) [0x55eeadd1348c] 10: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x12b) [0x55eeadd179fb] 11: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0xbb4) [0x55eeadc6e224] 12: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7bb) [0x55eeadc70bdb] 13: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x55) [0x55eeadc711d5] 14: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x55eeadc60dc8] 15: (DispatchQueue::entry()+0x126a) [0x7f40e9374b5a] 16: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f40e9426841] 17: /lib64/libpthread.so.0(+0x81ca) [0x7f40e810f1ca] 18: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Some interesting log snippets from before
-434> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2024-03-28T19:18:03.483427+0100) -433> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 log_client log_queue is 6 last_log 1380 sent 1374 num 6 unsent 6 sending 6 -432> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 log_client will send 2024-03-28T19:18:32.701758+0100 mds.minilevinson-983db664a3 (mds.0) 1375 : cluster [WRN] 5 slow requests, 5 included below; oldest blocked for > 13.152129 secs -431> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 log_client will send 2024-03-28T19:18:32.701765+0100 mds.minilevinson-983db664a3 (mds.0) 1376 : cluster [WRN] slow request 7.203 507 seconds old, received at 2024-03-28T19:18:25.497610+0100: client_request(mds.0:181210 rename #0x100026f338d/sh15637 #0x606/1000295a11a caller_uid=0, caller_gid=0{}) currently sub mit entry: journal_and_reply -430> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 log_client will send 2024-03-28T19:18:32.701771+0100 mds.minilevinson-983db664a3 (mds.0) 1377 : cluster [WRN] slow request 6.519 057 seconds old, received at 2024-03-28T19:18:26.182061+0100: client_request(mds.0:181239 rename #0x100026f338d/sh15811 #0x608/1000295a58f caller_uid=0, caller_gid=0{}) currently sub mit entry: journal_and_reply -429> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 log_client will send 2024-03-28T19:18:32.701774+0100 mds.minilevinson-983db664a3 (mds.0) 1378 : cluster [WRN] slow request 6.372 089 seconds old, received at 2024-03-28T19:18:26.329029+0100: client_request(mds.0:181300 rename #0x1000185ba3c/file117_hardmv_hard #0x603/10002567d82 caller_uid=0, caller_gid=0{}) c urrently failed to wrlock, waiting -428> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 log_client will send 2024-03-28T19:18:32.701776+0100 mds.minilevinson-983db664a3 (mds.0) 1379 : cluster [WRN] slow request 6.371 098 seconds old, received at 2024-03-28T19:18:26.330020+0100: client_request(mds.0:181320 rename #0x1000185ba3c/file373_hardmv_hard #0x608/1000256807c caller_uid=0, caller_gid=0{}) c urrently failed to wrlock, waiting -427> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 log_client will send 2024-03-28T19:18:32.701781+0100 mds.minilevinson-983db664a3 (mds.0) 1380 : cluster [WRN] slow request 5.565 985 seconds old, received at 2024-03-28T19:18:27.135132+0100: client_request(mds.0:181349 rename #0x100026f338d/sh15708 #0x606/1000295a111 caller_uid=0, caller_gid=0{}) currently fai led to wrlock, waiting -426> 2024-03-28T19:18:33.482+0100 7f40df6ed700 10 monclient: _send_mon_message to mon.minilevinson-f52cbe8096 at v2:188.184.88.255:3300/0 -425> 2024-03-28T19:18:33.765+0100 7f40e06ef700 10 log_client handle_log_ack log(last 1380) v1 -424> 2024-03-28T19:18:33.765+0100 7f40e06ef700 10 log_client logged 2024-03-28T19:18:32.701758+0100 mds.minilevinson-983db664a3 (mds.0) 1375 : cluster [WRN] 5 slow requests, 5 in cluded below; oldest blocked for > 13.152129 secs -423> 2024-03-28T19:18:33.765+0100 7f40e06ef700 10 log_client logged 2024-03-28T19:18:32.701765+0100 mds.minilevinson-983db664a3 (mds.0) 1376 : cluster [WRN] slow request 7.203507 seconds old, received at 2024-03-28T19:18:25.497610+0100: client_request(mds.0:181210 rename #0x100026f338d/sh15637 #0x606/1000295a11a caller_uid=0, caller_gid=0{}) currently submit entry: journal_and_reply -422> 2024-03-28T19:18:33.765+0100 7f40e06ef700 10 log_client logged 2024-03-28T19:18:32.701771+0100 mds.minilevinson-983db664a3 (mds.0) 1377 : cluster [WRN] slow request 6.519057 seconds old, received at 2024-03-28T19:18:26.182061+0100: client_request(mds.0:181239 rename #0x100026f338d/sh15811 #0x608/1000295a58f caller_uid=0, caller_gid=0{}) currently submit entry: journal_and_reply -421> 2024-03-28T19:18:33.765+0100 7f40e06ef700 10 log_client logged 2024-03-28T19:18:32.701774+0100 mds.minilevinson-983db664a3 (mds.0) 1378 : cluster [WRN] slow request 6.372089 seconds old, received at 2024-03-28T19:18:26.329029+0100: client_request(mds.0:181300 rename #0x1000185ba3c/file117_hardmv_hard #0x603/10002567d82 caller_uid=0, caller_gid=0{}) curr ently failed to wrlock, waiting -420> 2024-03-28T19:18:33.765+0100 7f40e06ef700 10 log_client logged 2024-03-28T19:18:32.701776+0100 mds.minilevinson-983db664a3 (mds.0) 1379 : cluster [WRN] slow request 6.371098 seconds old, received at 2024-03-28T19:18:26.330020+0100: client_request(mds.0:181320 rename #0x1000185ba3c/file373_hardmv_hard #0x608/1000256807c caller_uid=0, caller_gid=0{}) curr ently failed to wrlock, waiting -419> 2024-03-28T19:18:33.765+0100 7f40e06ef700 10 log_client logged 2024-03-28T19:18:32.701781+0100 mds.minilevinson-983db664a3 (mds.0) 1380 : cluster [WRN] slow request 5.565985 seconds old, received at 2024-03-28T19:18:27.135132+0100: client_request(mds.0:181349 rename #0x100026f338d/sh15708 #0x606/1000295a111 caller_uid=0, caller_gid=0{}) currently failed to wrlock, waiting
We've configured the mds `mds_op_complaint_time` to 5s in order to track some potential slow locks
Updated by Abhishek Lekshmanan about 1 month ago
- Description updated (diff)
- Component(FS) MDS added
- Labels (FS) crash added
Updated by Venky Shankar about 1 month ago
- Category set to Correctness/Safety
- Assignee set to Venky Shankar
- Target version set to v20.0.0
- Backport set to quincy,reef,squid
Actions