Project

General

Profile

Bug #58041

mds: src/mds/Server.cc: 3231: FAILED ceph_assert(straydn->get_name() == straydname)

Added by Venky Shankar over 1 year ago. Updated about 1 year ago.

Status:
Duplicate
Priority:
Normal
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash, multimds
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Nov  6 07:26:27 host0 /builddir/build/BUILD/ceph-16.2.8/src/mds/Server.cc: In function 'CDentry* Server::prepare_stray_dentry(MDRequestRef&, CInode*)' thread 7feb58dcd700 time 2022-11-06T13:26:27.233738+0000
Nov  6 07:26:27 host0 : /builddir/build/BUILD/ceph-16.2.8/src/mds/Server.cc: 3231: FAILED ceph_assert(straydn->get_name() == straydname)
Nov  6 07:26:27 host0 : ceph version 16.2.8-84.el8cp (c2980f2fd700e979d41b4bad2939bb90f0fe435c) pacific (stable)
Nov  6 07:26:27 host0 : 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7feb617faaa8]
Nov  6 07:26:27 host0 : 2: /usr/lib64/ceph/libceph-common.so.2(+0x277cc2) [0x7feb617facc2]
Nov  6 07:26:27 host0 : 3: (Server::prepare_stray_dentry(boost::intrusive_ptr<MDRequestImpl>&, CInode*)+0x95) [0x55aee13049e5]
Nov  6 07:26:27 host0 : 4: (Server::handle_client_rename(boost::intrusive_ptr<MDRequestImpl>&)+0x1091) [0x55aee132bff1]
Nov  6 07:26:27 host0 : 5: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xe9a) [0x55aee135360a]
Nov  6 07:26:27 host0 : 6: (MDCache::dispatch_request(boost::intrusive_ptr<MDRequestImpl>&)+0x33) [0x55aee13fb9b3]
Nov  6 07:26:27 host0 : 7: (MDSContext::complete(int)+0x203) [0x55aee15b7ca3]
Nov  6 07:26:27 host0 : 8: (MDSCacheObject::finish_waiting(unsigned long, int)+0xce) [0x55aee15d9b4e]
Nov  6 07:26:27 host0 : 9: (Locker::eval_gather(SimpleLock*, bool, bool*, std::vector<MDSContext*, std::allocator<MDSContext*> >*)+0x13d6) [0x55aee148ca86]
Nov  6 07:26:27 host0 : 10: (CDentry::remove_client_lease(ClientLease*, Locker*)+0x466) [0x55aee14f1a06]
Nov  6 07:26:27 host0 : 11: (Locker::handle_client_lease(boost::intrusive_ptr<MClientLease const> const&)+0xc6a) [0x55aee147d2ea]
Nov  6 07:26:27 host0 : 12: (Locker::dispatch(boost::intrusive_ptr<Message const> const&)+0x134) [0x55aee149f944]
Nov  6 07:26:27 host0 : 13: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0xbcc) [0x55aee12aeb6c]
Nov  6 07:26:27 host0 : 14: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7bb) [0x55aee12b150b]
Nov  6 07:26:27 host0 : 15: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x55) [0x55aee12b1b05]
Nov  6 07:26:27 host0 : 16: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x55aee12a16f8]
Nov  6 07:26:27 host0 : 17: (DispatchQueue::entry()+0x126a) [0x7feb61a428ba]
Nov  6 07:26:27 host0 : 18: (DispatchQueue::DispatchThread::entry()+0x11) [0x7feb61af4b81]
Nov  6 07:26:27 host0 : 19: /lib64/libpthread.so.0(+0x81cf) [0x7feb607dd1cf]
Nov  6 07:26:27 host0 : 20: clone()
Nov  6 07:26:27 host0 : *** Caught signal (Aborted) **
Nov  6 07:26:27 host0 : in thread 7feb58dcd700 thread_name:ms_dispatch
Nov  6 07:26:27 host0 : debug 2022-11-06T13:26:27.300+0000 7feb58dcd700 -1 /builddir/build/BUILD/ceph-16.2.8/src/mds/Server.cc: In function 'CDentry* Server::prepare_stray_dentry(MDRequestRef&, CInode*)' thread 7feb58dcd700 time 2022-11-06T13:26:27.233738+0000

Possibly looks like something raced with a rename. Looking at the backtrace, `handle_client_rename` was put on wait (maybe to revoke caps from another client), when woken up ran into the assert that verifies that the stray dentry name (from the MDRequest) should match the name generated using the inode number.


Related issues

Related to CephFS - Bug #55332: Failure in snaptest-git-ceph.sh (it's an async unlink/create bug) Resolved

History

#1 Updated by Venky Shankar over 1 year ago

oh, and btw this was seen in ceph-16.2.8.

#2 Updated by Venky Shankar over 1 year ago

  • Labels (FS) multimds added

and another side note, the crash was seen when a directory pin was removed from rank-0 mds. Pinning it back again ceases the crash.

#3 Updated by Milind Changire about 1 year ago

Due to unavailability of debug logs, there has been some speculation about the issue during discussion with Venky.
The issue here is most likely due to a file create op racing with a lagging async unlink op.

This specific issue has been addressed by Xiubo in his PR: https://github.com/ceph/ceph/pull/47399

#4 Updated by Venky Shankar about 1 year ago

  • Status changed from New to Duplicate

Milind Changire wrote:

Due to unavailability of debug logs, there has been some speculation about the issue during discussion with Venky.
The issue here is most likely due to a file create op racing with a lagging async unlink op.

It does look related to async unlink.

This specific issue has been addressed by Xiubo in his PR: https://github.com/ceph/ceph/pull/47399

I think this is the correct fix - https://github.com/ceph/ceph/pull/46331

Closing this as the pacific backport (https://github.com/ceph/ceph/pull/48453) is pending merge. Please reopen if its seen again.

#5 Updated by Venky Shankar about 1 year ago

  • Related to Bug #55332: Failure in snaptest-git-ceph.sh (it's an async unlink/create bug) added

Also available in: Atom PDF