Bug #17738: Deadlock when shutdown() is called while still in init() - mgr - Ceph

Actions

Copy link

Bug #17738

closed

Deadlock when shutdown() is called while still in init()

Added by John Spray over 7 years ago. Updated almost 7 years ago.

Status:

Duplicate

Priority:

High

Assignee:

Category:

ceph-mgr

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Thread 2 (Thread 0x7fd7ed1ed700 (LWP 5473)):
#0  0x00007fd7f7b246d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fd7fadb2f40 in C_SaferCond::wait() ()
#2  0x00007fd7fadc8cd5 in Mgr::init() ()
#3  0x00007fd7fadb2749 in Context::complete(int) ()
#4  0x00007fd7fae56c26 in Finisher::finisher_thread_entry() ()
#5  0x00007fd7f7b20dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fd7f6c0bced in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7fd7f01f3700 (LWP 14694)):
#0  0x00007fd7f7b21ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007fd7faf67db0 in Thread::join(void**) ()
#2  0x00007fd7fae561e0 in Finisher::stop() ()
#3  0x00007fd7fadc3e2f in Mgr::shutdown() ()
#4  0x00007fd7fadbd8e8 in MgrStandby::handle_mgr_map(MMgrMap*) ()
#5  0x00007fd7fadbe25d in MgrStandby::ms_dispatch(Message*) ()
#6  0x00007fd7fb03329a in DispatchQueue::entry() ()
#7  0x00007fd7faedffed in DispatchQueue::DispatchThread::entry() ()
#8  0x00007fd7f7b20dc5 in start_thread () from /lib64/libpthread.so.0
#9  0x00007fd7f6c0bced in clone () from /lib64/libc.so.6

init drops the lock while it waits for maps. shutdown blocks on all finishers since this commit:

commit ce6b1909dd5579326dc331937c002078c3a1e55a
Author: xie xingguo <xie.xingguo@zte.com.cn>
Date:   Sun Oct 9 10:55:12 2016 +0800

    mgr: shutdown finisher in a graceful way

    Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

The shutdown function probably needs to either not wait for the finisher to drain, or needs to drop the Mgr::lock while it drains. Separately to that, init() probably needs to be refactored to avoid blocking, as if the mons are offline and we can't get maps, we should still be able to shutdown cleanly.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Chang Liu over 7 years ago

i'm working on this. how about checking waiting_for_fs_map in Mgr::shutdown? if waiting_for_fs_map exists, we could call “waiting_for_fs_map->complete(-1);" to release C_SaferCond.

Actions

Copy link

Updated by Chang Liu over 7 years ago

I'm testing it, modifies as following:

diff --git a/src/mgr/Mgr.cc b/src/mgr/Mgr.cc
index 8faf6bd..e788f73 100644
--- a/src/mgr/Mgr.cc
+++ b/src/mgr/Mgr.cc
@@ -192,9 +192,13 @@ void Mgr::init()
   // Wait for FSMap
   dout(4) << "waiting for FSMap..." << dendl;
   lock.Unlock();
-  cond.wait();
+  int r = cond.wait();
   lock.Lock();
   waiting_for_fs_map = nullptr;
+  if (r == -1) {
+    dout(4) << "Get FSMap failed, probablely called shutdown." << dendl;
+    return;
+  }
   dout(4) << "Got FSMap." << dendl;

   // Wait for MgrDigest...?
@@ -341,6 +345,13 @@ void Mgr::shutdown()
   // First stop the server so that we're not taking any more incoming requests
   server.shutdown();

+  
+  // interupte waiting for fs_map
+  if (waiting_for_fs_map) {
+    waiting_for_fs_map->complete(-1);
+    waiting_for_fs_map = NULL;
+  }
+
   // Then stop the finisher to ensure its enqueued contexts aren't going
   // to touch references to the things we're about to tear down
   finisher.wait_for_empty();