https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2017-05-01T02:30:43ZCeph Ceph - Bug #19813: mgr init deadlockhttps://tracker.ceph.com/issues/19813?journal_id=905672017-05-01T02:30:43ZSage Weilsage@newdream.net
<ul></ul><p>The problem is init is done in the background (background_init) but shutdown is called synchronously from Mgr::handle_mgr_map. Shutdown should probably also be async, but i think we need some explicit states to manage this. Like activating[-needs-stop], active, stopping[-needs-start], stopped (or similar).</p> Ceph - Bug #19813: mgr init deadlockhttps://tracker.ceph.com/issues/19813?journal_id=905952017-05-01T21:24:45ZSage Weilsage@newdream.net
<ul></ul><p>I looked at this one a bit more. I started trying to create states in Mgr and then MgrStandby for starting, active, stopping, standby, and then a target state, but it's a bit tedious to restructure it all, and even if we <strong>do</strong> do that, we can still get in a state where init() is blocked in some background async thread and so even though we want to shut down we are stuck in the 'starting' state until init completes (which might never happen if, say, the mons are down).</p>
<p>I think a better path would be to make Mgr state-driven instead of imperitive, so that init() does not block and background_init() can go away. Instead, it would have internal states for the various parts of its initialization, and at any point we can tell it to shut down and it will throw it all away and become deleteable.</p>
<p>Is that feasible, given that the python modules may also be written in a blocking fashion, or are we stuck in a world where mgr is doing blocking things and we have to wait for them?</p> Ceph - Bug #19813: mgr init deadlockhttps://tracker.ceph.com/issues/19813?journal_id=906252017-05-02T15:19:25ZSage Weilsage@newdream.net
<ul></ul><p>/a/sage-2017-05-02_10:48:00-rados-wip-sage-testing---basic-smithi/1092427</p> Ceph - Bug #19813: mgr init deadlockhttps://tracker.ceph.com/issues/19813?journal_id=907192017-05-04T02:32:19ZSage Weilsage@newdream.net
<ul></ul><p>/a/sage-2017-05-03_20:20:09-rados-wip-sage-testing---basic-smithi/1098337</p> Ceph - Bug #19813: mgr init deadlockhttps://tracker.ceph.com/issues/19813?journal_id=907282017-05-04T06:50:04ZKefu Chaitchaikov@gmail.com
<ul></ul><pre>
<code class="cpp syntaxhl"><span class="CodeRay"><span class="directive">void</span> MgrStandby::shutdown()
{
<span class="comment">// Expect already to be locked as we're called from signal handler</span>
assert(lock.is_locked_by_me());
<span class="keyword">if</span> (active_mgr) {
active_mgr->shutdown();
}
objecter->shutdown();
timer.shutdown();
monc->shutdown();
client_messenger->shutdown();
}
</span></code><br /></pre><br />in the outer <code>MgrStandby::shutdown()</code>, the client_messenger is the last one to be stopped. but it should not be, as we are handling MMgrMap sent from mon, and trying to init()/shutdown() Mgr per requested by it:<br /><pre>
2017-05-03 22:03:27.793623 7fe4aa7a7700 1 -- 172.21.15.164:0/781609493 <== mon.1 172.21.15.166:6789/0 6 ==== mgrmap(e 6) v1 ==== 74+0+0 (455897865 0 0) 0x7fe4b9992d00 con 0x7fe4b9e5e700
2017-05-03 22:02:33.789129 7fe4aa7a7700 4 mgr handle_mgr_map received map epoch 4
2017-05-03 22:02:33.789130 7fe4aa7a7700 4 mgr handle_mgr_map active in map: 0 active is 0
2017-05-03 22:02:33.789132 7fe4aa7a7700 -1 mgr handle_mgr_map I was active but no longer am
2017-05-03 22:02:33.789152 7fe4a649c700 10 mgr.server shutdown shutdown
2017-05-03 22:02:33.789160 7fe4a649c700 1 -- 172.21.15.164:6812/23102 mark_down_all
2017-05-03 22:02:33.789574 7fe4a649c700 1 -- 172.21.15.164:6812/23102 shutdown complete.
2017-05-03 22:02:33.789592 7fe4a649c700 10 mgr.server shutdown shutdown done
2017-05-03 22:02:34.480112 7fe4aa7a7700 1 -- 172.21.15.164:0/781609493 <== mon.7 172.21.15.166:6792/0 9 ==== mgrmap(e 5) v1 ==== 75+0+0 (455823694 0 0) 0x7fe4b99921c0 con 0x7fe4b9e5ef00
2017-05-03 22:02:34.480147 7fe4aa7a7700 4 mgr handle_mgr_map received map epoch 5?
2017-05-03 22:02:34.480150 7fe4aa7a7700 4 mgr handle_mgr_map active in map: 1 active is 4111
2017-05-03 22:02:34.480162 7fe4aa7a7700 1 mgr handle_mgr_map Activating!
2017-05-03 22:03:27.793623 7fe4aa7a7700 1 -- 172.21.15.164:0/781609493 <== mon.1 172.21.15.166:6789/0 6 ==== mgrmap(e 6) v1 ==== 74+0+0 (455897865 0 0) 0x7fe4b9992d00 con 0x7fe4b9e5e700
2017-05-03 22:03:27.793637 7fe4aa7a7700 -1 mgr handle_mgr_map I was active but no longer am
</pre> Ceph - Bug #19813: mgr init deadlockhttps://tracker.ceph.com/issues/19813?journal_id=907382017-05-04T09:15:24ZKefu Chaitchaikov@gmail.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Fix Under Review</i></li><li><strong>Assignee</strong> set to <i>Kefu Chai</i></li></ul><p><a class="external" href="https://github.com/ceph/ceph/pull/14952">https://github.com/ceph/ceph/pull/14952</a></p> Ceph - Bug #19813: mgr init deadlockhttps://tracker.ceph.com/issues/19813?journal_id=908932017-05-09T15:01:25ZSage Weilsage@newdream.net
<ul><li><strong>Status</strong> changed from <i>Fix Under Review</i> to <i>Resolved</i></li></ul> Ceph - Bug #19813: mgr init deadlockhttps://tracker.ceph.com/issues/19813?journal_id=918032017-05-30T15:31:39ZSage Weilsage@newdream.net
<ul><li><strong>Status</strong> changed from <i>Resolved</i> to <i>12</i></li></ul><p>on bigbang,<br /><pre>Thread 4 (Thread 0x7ff1062db700 (LWP 3611122)):
#0 0x00007ff110f386d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007ff1134bc528 in Wait (mutex=..., this=0x7ff1062d9cd8) at /usr/src/debug/ceph-12.0.2-1883-gb3f5819/src/common/Cond.h:56
#2 wait (this=<optimized out>) at /usr/src/debug/ceph-12.0.2-1883-gb3f5819/src/common/Cond.h:202
#3 wait (this=<optimized out>) at /usr/src/debug/ceph-12.0.2-1883-gb3f5819/src/mgr/MgrContext.h:39
#4 JSONCommand::wait (this=this@entry=0x7ff1062d9c70) at /usr/src/debug/ceph-12.0.2-1883-gb3f5819/src/mgr/MgrContext.h:53
#5 0x00007ff1134b9462 in Mgr::load_config (this=this@entry=0x7ff11ccb4000) at /usr/src/debug/ceph-12.0.2-1883-gb3f5819/src/mgr/Mgr.cc:330
#6 0x00007ff1134bb7f3 in Mgr::init (this=0x7ff11ccb4000) at /usr/src/debug/ceph-12.0.2-1883-gb3f5819/src/mgr/Mgr.cc:218
#7 0x00007ff113492ada in operator() (a0=<optimized out>, this=<optimized out>) at /usr/src/debug/ceph-12.0.2-1883-gb3f5819/build/boost/include/boost/function/function_template.hpp:771
#8 FunctionContext::finish (this=<optimized out>, r=<optimized out>) at /usr/src/debug/ceph-12.0.2-1883-gb3f5819/src/include/Context.h:489
#9 0x00007ff11348eb39 in Context::complete (this=0x7ff11d1ed000, r=<optimized out>) at /usr/src/debug/ceph-12.0.2-1883-gb3f5819/src/include/Context.h:70
#10 0x00007ff1135de378 in Finisher::finisher_thread_entry (this=0x7ff11ccb4140) at /usr/src/debug/ceph-12.0.2-1883-gb3f5819/src/common/Finisher.cc:72
#11 0x00007ff110f34dc5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007ff10fe1473d in clone () from /lib64/libc.so.6
</pre><br />vs <br /><pre>
Thread 12 (Thread 0x7ff10a2e3700 (LWP 3600515)):
#0 0x00007ff110f386d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007ff1135ddda4 in Wait (mutex=..., this=0x7ff11ccb41e0) at /usr/src/debug/ceph-12.0.2-1883-gb3f5819/src/common/Cond.h:56
#2 Finisher::wait_for_empty (this=this@entry=0x7ff11ccb4140) at /usr/src/debug/ceph-12.0.2-1883-gb3f5819/src/common/Finisher.cc:35
#3 0x00007ff1134b6e91 in Mgr::shutdown (this=0x7ff11ccb4000) at /usr/src/debug/ceph-12.0.2-1883-gb3f5819/src/mgr/Mgr.cc:376
#4 0x00007ff1134af320 in MgrStandby::handle_mgr_map (this=this@entry=0x7fffb166c920, mmap=mmap@entry=0x7ff12099e640) at /usr/src/debug/ceph-12.0.2-1883-gb3f5819/src/mgr/MgrStandby.cc:252
#5 0x00007ff1134afb6e in MgrStandby::ms_dispatch (this=0x7fffb166c920, m=0x7ff12099e640) at /usr/src/debug/ceph-12.0.2-1883-gb3f5819/src/mgr/MgrStandby.cc:272
#6 0x00007ff11380cac2 in ms_deliver_dispatch (m=0x7ff12099e640, this=0x7ff11c970000) at /usr/src/debug/ceph-12.0.2-1883-gb3f5819/src/msg/Messenger.h:617
#7 DispatchQueue::entry (this=0x7ff11c970158) at /usr/src/debug/ceph-12.0.2-1883-gb3f5819/src/msg/DispatchQueue.cc:197
#8 0x00007ff11367bd0d in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /usr/src/debug/ceph-12.0.2-1883-gb3f5819/src/msg/DispatchQueue.h:102
#9 0x00007ff110f34dc5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007ff10fe1473d in clone () from /lib64/libc.so.6
</pre></p> Ceph - Bug #19813: mgr init deadlockhttps://tracker.ceph.com/issues/19813?journal_id=918042017-05-30T16:22:20ZSage Weilsage@newdream.net
<ul><li><strong>Status</strong> changed from <i>12</i> to <i>Resolved</i></li></ul><p>hmm, actually, i think that was just a laggy mon.</p>