Project

General

Profile

Actions

Bug #17738

closed

Deadlock when shutdown() is called while still in init()

Added by John Spray over 7 years ago. Updated almost 7 years ago.

Status:
Duplicate
Priority:
High
Assignee:
-
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Thread 2 (Thread 0x7fd7ed1ed700 (LWP 5473)):
#0  0x00007fd7f7b246d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fd7fadb2f40 in C_SaferCond::wait() ()
#2  0x00007fd7fadc8cd5 in Mgr::init() ()
#3  0x00007fd7fadb2749 in Context::complete(int) ()
#4  0x00007fd7fae56c26 in Finisher::finisher_thread_entry() ()
#5  0x00007fd7f7b20dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fd7f6c0bced in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7fd7f01f3700 (LWP 14694)):
#0  0x00007fd7f7b21ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007fd7faf67db0 in Thread::join(void**) ()
#2  0x00007fd7fae561e0 in Finisher::stop() ()
#3  0x00007fd7fadc3e2f in Mgr::shutdown() ()
#4  0x00007fd7fadbd8e8 in MgrStandby::handle_mgr_map(MMgrMap*) ()
#5  0x00007fd7fadbe25d in MgrStandby::ms_dispatch(Message*) ()
#6  0x00007fd7fb03329a in DispatchQueue::entry() ()
#7  0x00007fd7faedffed in DispatchQueue::DispatchThread::entry() ()
#8  0x00007fd7f7b20dc5 in start_thread () from /lib64/libpthread.so.0
#9  0x00007fd7f6c0bced in clone () from /lib64/libc.so.6

init drops the lock while it waits for maps. shutdown blocks on all finishers since this commit:

commit ce6b1909dd5579326dc331937c002078c3a1e55a
Author: xie xingguo <xie.xingguo@zte.com.cn>
Date:   Sun Oct 9 10:55:12 2016 +0800

    mgr: shutdown finisher in a graceful way

    Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

The shutdown function probably needs to either not wait for the finisher to drain, or needs to drop the Mgr::lock while it drains. Separately to that, init() probably needs to be refactored to avoid blocking, as if the mons are offline and we can't get maps, we should still be able to shutdown cleanly.


Related issues 1 (0 open1 closed)

Is duplicate of mgr - Bug #19743: mgr: segv in Mgr::shutdown() in PyThread_acquire_lock()ResolvedKefu Chai04/21/2017

Actions
Actions #1

Updated by Chang Liu over 7 years ago

i'm working on this. how about checking waiting_for_fs_map in Mgr::shutdown? if waiting_for_fs_map exists, we could call “waiting_for_fs_map->complete(-1);" to release C_SaferCond.

Actions #2

Updated by Chang Liu over 7 years ago

I'm testing it, modifies as following:

diff --git a/src/mgr/Mgr.cc b/src/mgr/Mgr.cc
index 8faf6bd..e788f73 100644
--- a/src/mgr/Mgr.cc
+++ b/src/mgr/Mgr.cc
@@ -192,9 +192,13 @@ void Mgr::init()
   // Wait for FSMap
   dout(4) << "waiting for FSMap..." << dendl;
   lock.Unlock();
-  cond.wait();
+  int r = cond.wait();
   lock.Lock();
   waiting_for_fs_map = nullptr;
+  if (r == -1) {
+    dout(4) << "Get FSMap failed, probablely called shutdown." << dendl;
+    return;
+  }
   dout(4) << "Got FSMap." << dendl;

   // Wait for MgrDigest...?
@@ -341,6 +345,13 @@ void Mgr::shutdown()
   // First stop the server so that we're not taking any more incoming requests
   server.shutdown();

+  
+  // interupte waiting for fs_map
+  if (waiting_for_fs_map) {
+    waiting_for_fs_map->complete(-1);
+    waiting_for_fs_map = NULL;
+  }
+
   // Then stop the finisher to ensure its enqueued contexts aren't going
   // to touch references to the things we're about to tear down
   finisher.wait_for_empty();

Actions #3

Updated by Chang Liu over 7 years ago

By the way, John, How could I reproduce this bug easily? very thanks

Actions #4

Updated by John Spray about 7 years ago

Chang: hi, I think I wasn't receiving emails for mgr tickets, so I didn't notice your comments. Are you still working on this?

Actions #5

Updated by John Spray about 7 years ago

  • Priority changed from Normal to High
Actions #6

Updated by Kefu Chai almost 7 years ago

  • Is duplicate of Bug #19743: mgr: segv in Mgr::shutdown() in PyThread_acquire_lock() added
Actions #7

Updated by Kefu Chai almost 7 years ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF