Bug #13004
Updated by Kefu Chai over 8 years ago
Below here is the error log from an unexpectedly doomed OSD: <pre> 2015-09-07 11:10:35.989338 7fba19fb1700 0 osd.10 183299 prepare_to_stop starting shutdown 2015-09-07 11:10:35.991902 7fba19fb1700 -1 common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7fba19fb1700 time 2015-09-07 11:10:35.989362 common/Mutex.cc: 95: FAILED assert(r == 0) ceph version 0.87.2 (87a7cec9ab11c677de2ab23a7668a77d2f5b955e) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xa4) [0xbacb54] 2: (Mutex::Lock(bool)+0x105) [0xb5b9e5] 3: (OSD::shutdown()+0x7f) [0x671c4f] 4: (OSD::handle_osd_map(MOSDMap*)+0x19d2) [0x6aa4b2] 5: (OSD::_dispatch(Message*)+0x41b) [0x6ac38b] 6: (OSD::ms_dispatch(Message*)+0x267) [0x6ac8a7] 7: (DispatchQueue::entry()+0x62a) [0xc53eaa] 8: (DispatchQueue::DispatchThread::entry()+0xd) [0xb88b4d] 9: (()+0x7df3) [0x7fba2c355df3] 10: (clone()+0x6d) [0x7fba2ae3854d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. </pre> The direct cause of this assert is that the specified Dispatch thread is trying to obtain the osd_lock during the OSD::shutdown procedure which is already held by itself in the earlier OSD::ms_dispach procedure, see below: <pre><code class="cpp"> bool _bool OSD::ms_dispatch(Message *m) { if (m->get_type() == MSG_OSD_MARK_ME_DOWN) { service.got_stop_ack(); m->put(); return true; } // lock! osd_lock.Lock(); // the osd_lock is already held by the calling thread here. ... int OSD::shutdown() { if (!service.prepare_to_stop()) return 0; // already shutting down osd_lock.Lock(); // the shutdown procedure try to hold the same lock again ... ..._ </code></pre> The real reason why the OSD process fails to process the CEPH_MSG_OSD_MAP message is that " -5592> 2015-09-07 11:09:33.006845 7fba19fb1700 0 log_channel(default) log [WRN] : map e183299 wrongly marked me down" and the cluster_messenger rebind process is failure(see another bug reported: BUG #13002 Accepter::bind won't work correctly in some exception cases). And my solution for this a little bit rare senario is this(for your information): <pre><code class="cpp"> int _int OSD::shutdown() { if (!service.prepare_to_stop()) return 0; // already shutting down * bool need_lock = !osd_lock.is_locked_by_me(); if (need_lock) osd_lock.Lock(); osd_lock.Lock();* if (is_stopping()) { * if (need_lock) osd_lock.Unlock(); osd_lock.Unlock();* return 0; } ... * if (!need_lock) osd_lock.Lock();//restore lock status, the caller shall take control of it.* return r; } </code></pre> }_