Bug #17738
closedDeadlock when shutdown() is called while still in init()
0%
Description
Thread 2 (Thread 0x7fd7ed1ed700 (LWP 5473)): #0 0x00007fd7f7b246d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fd7fadb2f40 in C_SaferCond::wait() () #2 0x00007fd7fadc8cd5 in Mgr::init() () #3 0x00007fd7fadb2749 in Context::complete(int) () #4 0x00007fd7fae56c26 in Finisher::finisher_thread_entry() () #5 0x00007fd7f7b20dc5 in start_thread () from /lib64/libpthread.so.0 #6 0x00007fd7f6c0bced in clone () from /lib64/libc.so.6 Thread 8 (Thread 0x7fd7f01f3700 (LWP 14694)): #0 0x00007fd7f7b21ef7 in pthread_join () from /lib64/libpthread.so.0 #1 0x00007fd7faf67db0 in Thread::join(void**) () #2 0x00007fd7fae561e0 in Finisher::stop() () #3 0x00007fd7fadc3e2f in Mgr::shutdown() () #4 0x00007fd7fadbd8e8 in MgrStandby::handle_mgr_map(MMgrMap*) () #5 0x00007fd7fadbe25d in MgrStandby::ms_dispatch(Message*) () #6 0x00007fd7fb03329a in DispatchQueue::entry() () #7 0x00007fd7faedffed in DispatchQueue::DispatchThread::entry() () #8 0x00007fd7f7b20dc5 in start_thread () from /lib64/libpthread.so.0 #9 0x00007fd7f6c0bced in clone () from /lib64/libc.so.6
init drops the lock while it waits for maps. shutdown blocks on all finishers since this commit:
commit ce6b1909dd5579326dc331937c002078c3a1e55a Author: xie xingguo <xie.xingguo@zte.com.cn> Date: Sun Oct 9 10:55:12 2016 +0800 mgr: shutdown finisher in a graceful way Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
The shutdown function probably needs to either not wait for the finisher to drain, or needs to drop the Mgr::lock while it drains. Separately to that, init() probably needs to be refactored to avoid blocking, as if the mons are offline and we can't get maps, we should still be able to shutdown cleanly.
Updated by Chang Liu over 7 years ago
i'm working on this. how about checking waiting_for_fs_map in Mgr::shutdown? if waiting_for_fs_map exists, we could call “waiting_for_fs_map->complete(-1);" to release C_SaferCond.
Updated by Chang Liu over 7 years ago
I'm testing it, modifies as following:
diff --git a/src/mgr/Mgr.cc b/src/mgr/Mgr.cc
index 8faf6bd..e788f73 100644
--- a/src/mgr/Mgr.cc
+++ b/src/mgr/Mgr.cc
@@ -192,9 +192,13 @@ void Mgr::init()
// Wait for FSMap
dout(4) << "waiting for FSMap..." << dendl;
lock.Unlock();
- cond.wait();
+ int r = cond.wait();
lock.Lock();
waiting_for_fs_map = nullptr;
+ if (r == -1) {
+ dout(4) << "Get FSMap failed, probablely called shutdown." << dendl;
+ return;
+ }
dout(4) << "Got FSMap." << dendl;
// Wait for MgrDigest...?
@@ -341,6 +345,13 @@ void Mgr::shutdown()
// First stop the server so that we're not taking any more incoming requests
server.shutdown();
+
+ // interupte waiting for fs_map
+ if (waiting_for_fs_map) {
+ waiting_for_fs_map->complete(-1);
+ waiting_for_fs_map = NULL;
+ }
+
// Then stop the finisher to ensure its enqueued contexts aren't going
// to touch references to the things we're about to tear down
finisher.wait_for_empty();
Updated by Chang Liu over 7 years ago
By the way, John, How could I reproduce this bug easily? very thanks
Updated by John Spray about 7 years ago
Chang: hi, I think I wasn't receiving emails for mgr tickets, so I didn't notice your comments. Are you still working on this?
Updated by Kefu Chai almost 7 years ago
- Is duplicate of Bug #19743: mgr: segv in Mgr::shutdown() in PyThread_acquire_lock() added