Bug #45261: mds: FAILED assert(locking == lock) in MutationImpl::finish_locking - CephFS - Ceph

Actions

Copy link

Bug #45261

closed

mds: FAILED assert(locking == lock) in MutationImpl::finish_locking

Added by Dan van der Ster about 4 years ago. Updated almost 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Zheng Yan

Category:

Target version:

Ceph - v16.0.0

% Done:

Source:

Community (dev)

Tags:

Backport:

octopus,nautilus,luminous

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

crash

Pull request ID:

34757

Crash signature (v1):

Crash signature (v2):

Description

Hi,

We got two identical crashes a few minutes apart on two different active MDS's:

2020-04-24 12:57:38.253616 7fb5a2485700 -1 /builddir/build/BUILD/ceph-12.2.12/src/mds/Mutation.cc: In function 'void Mu
tationImpl::finish_locking(SimpleLock*)' thread 7fb5a2485700 time 2020-04-24 12:57:38.246037
/builddir/build/BUILD/ceph-12.2.12/src/mds/Mutation.cc: 67: FAILED assert(locking == lock)

 ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55d0955c75f0]
 2: (()+0x38963f) [0x55d09533263f]
 3: (Locker::xlock_start(SimpleLock*, boost::intrusive_ptr<MDRequestImpl>&)+0x403) [0x55d09540e2b3]
 4: (Locker::acquire_locks(boost::intrusive_ptr<MDRequestImpl>&, std::set<SimpleLock*, std::less<SimpleLock*>, std::allocator<SimpleLock*> >&, std::set<SimpleLock*, std::less<SimpleLock*>, std::allocator<SimpleLock*> >&, std::set<SimpleLock*, std::less<SimpleLock*>, std::allocator<SimpleLock*> >&, std::map<SimpleLock*, int, std::less<SimpleLock*>, std::allocator<std::pair<SimpleLock* const, int> > >*, CInode*, bool)+0x1faa) [0x55d09541ce7a]
 5: (Server::handle_client_setattr(boost::intrusive_ptr<MDRequestImpl>&)+0x23c) [0x55d0952d548c]
 6: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xceb) [0x55d09530d7db]
 7: (MDSInternalContextBase::complete(int)+0x1eb) [0x55d09550e5ab]
 8: (void finish_contexts<MDSInternalContextBase>(CephContext*, std::list<MDSInternalContextBase*, std::allocator<MDSInternalContextBase*> >&, int)+0xac) [0x55d09527824c]
 9: (Locker::eval(CInode*, int, bool)+0x127) [0x55d095415f37]
 10: (Locker::handle_client_caps(MClientCaps*)+0x144f) [0x55d09542c1ff]
 11: (Locker::dispatch(Message*)+0xa5) [0x55d09542db95]
 12: (MDSRank::handle_deferrable_message(Message*)+0xbb4) [0x55d09527e484]
 13: (MDSRank::_dispatch(Message*, bool)+0x1e3) [0x55d095295de3]
 14: (MDSRankDispatcher::ms_dispatch(Message*)+0xa8) [0x55d095296db8]
 15: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x55d095274ef3]
 16: (DispatchQueue::entry()+0x792) [0x55d0958cba42]
 17: (DispatchQueue::DispatchThread::entry()+0xd) [0x55d09564f3ed]
 18: (()+0x7e65) [0x7fb5a74d6e65]
 19: (clone()+0x6d) [0x7fb5a65b188d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

The crashes were identical, but the it was from two different clients, on different inodes in different directories.
We have coredumps for further debugging.

Here we see locking is 0x0:

(gdb) up
#8  0x000055d09540e2b3 in Locker::xlock_start (this=this@entry=0x55d0a078a1b0, lock=0x55d282174590, mut=...)
    at /usr/src/debug/ceph-12.2.12/src/mds/Locker.cc:1661
1661        mut->finish_locking(lock);
(gdb) p lock
$2 = (SimpleLock *) 0x55d282174590

(gdb) p ((MutationImpl *)mut).locking
$10 = (SimpleLock *) 0x0
(gdb)

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Dan van der Ster about 4 years ago

I got in touch with the user who triggered this. It seems they (accidentally) had two identical jobs running on two different nodes at the same time. The jobs do this:

first thing the job does is create a temp dir, deleting it if it exists, it populates it, deletes the final destination if it exists and finally it renames it to the final destination
if two jobs run at once, one will at some point delete files the other one expects, making the other one fail

Actions

Copy link

Updated by Zheng Yan about 4 years ago

diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
index b2b4c21dc4..a92bbde879 100644
--- a/src/mds/Locker.cc
+++ b/src/mds/Locker.cc
@@ -1651,7 +1651,8 @@ bool Locker::xlock_start(SimpleLock *lock, MDRequestRef& mut)
   if (lock->get_parent()->is_auth()) {
     // auth
     while (1) {
-      if (lock->can_xlock(client) &&
+      if (mut->locking &&
+         lock->can_xlock(client) &&
          !(lock->get_state() == LOCK_LOCK_XLOCK &&     // client is not xlocker or
            in && in->issued_caps_need_gather(lock))) { // xlocker does not hold shared cap
        lock->set_state(LOCK_XLOCK);

diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
index b2b4c21dc4..1f96dca3b0 100644
--- a/src/mds/Locker.cc
+++ b/src/mds/Locker.cc
@@ -1658,7 +1658,8 @@ bool Locker::xlock_start(SimpleLock *lock, MDRequestRef& mut)
        lock->get_xlock(mut, client);
        mut->xlocks.insert(lock);
        mut->locks.insert(lock);
-       mut->finish_locking(lock);
+       if (mut->locking)
+         mut->finish_locking(lock);
        return true;
       }

should fix this issue. I prefer the first one because it keep ordering of xlocker. I will create PR later

Actions

Copy link

Updated by Zheng Yan about 4 years ago

Status changed from New to Fix Under Review
Pull request ID set to 34757

Actions

Copy link

Updated by Patrick Donnelly almost 4 years ago

Subject changed from mds FAILED assert(locking == lock) in MutationImpl::finish_locking to mds: FAILED assert(locking == lock) in MutationImpl::finish_locking
Assignee set to Zheng Yan
Target version set to v16.0.0
Source set to Community (dev)
Backport set to octopus,nautilus,luminous
Component(FS) MDS added
Labels (FS) crash added

Actions

Copy link

Updated by Patrick Donnelly almost 4 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

Updated by Nathan Cutler almost 4 years ago

Copied to Backport #45685: octopus: mds: FAILED assert(locking == lock) in MutationImpl::finish_locking added

Actions

Copy link

Updated by Nathan Cutler almost 4 years ago

Copied to Backport #45686: nautilus: mds: FAILED assert(locking == lock) in MutationImpl::finish_locking added

Actions

Copy link

Updated by Nathan Cutler almost 4 years ago

Copied to Backport #45687: luminous: mds: FAILED assert(locking == lock) in MutationImpl::finish_locking added

Actions

Copy link

Updated by Nathan Cutler almost 4 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #45261

mds: FAILED assert(locking == lock) in MutationImpl::finish_locking

Updated by Dan van der Ster about 4 years ago

Updated by Zheng Yan about 4 years ago

Updated by Zheng Yan about 4 years ago

Updated by Patrick Donnelly almost 4 years ago

Updated by Patrick Donnelly almost 4 years ago

Updated by Nathan Cutler almost 4 years ago

Updated by Nathan Cutler almost 4 years ago

Updated by Nathan Cutler almost 4 years ago

Updated by Nathan Cutler almost 4 years ago