Bug #13004: OSD deadlocked. - Ceph - Ceph

Bug #13004

Below here is the error log from an unexpectedly doomed OSD: 

 <pre> 
 2015-09-07 11:10:35.989338 7fba19fb1700    0 osd.10 183299 prepare_to_stop starting shutdown 
 2015-09-07 11:10:35.991902 7fba19fb1700 -1 common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7fba19fb1700 time 2015-09-07 11:10:35.989362 
 common/Mutex.cc: 95: FAILED assert(r == 0) 

  ceph version 0.87.2 (87a7cec9ab11c677de2ab23a7668a77d2f5b955e) 
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xa4) [0xbacb54] 
  2: (Mutex::Lock(bool)+0x105) [0xb5b9e5] 
  3: (OSD::shutdown()+0x7f) [0x671c4f] 
  4: (OSD::handle_osd_map(MOSDMap*)+0x19d2) [0x6aa4b2] 
  5: (OSD::_dispatch(Message*)+0x41b) [0x6ac38b] 
  6: (OSD::ms_dispatch(Message*)+0x267) [0x6ac8a7] 
  7: (DispatchQueue::entry()+0x62a) [0xc53eaa] 
  8: (DispatchQueue::DispatchThread::entry()+0xd) [0xb88b4d] 
  9: (()+0x7df3) [0x7fba2c355df3] 
  10: (clone()+0x6d) [0x7fba2ae3854d] 
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 
 </pre> 

 The direct cause of this assert is that the specified Dispatch thread is trying to obtain the osd_lock 
 during the OSD::shutdown procedure which is already held by itself in the earlier OSD::ms_dispach procedure, see below: 

 <pre><code class="cpp"> 
 bool _bool OSD::ms_dispatch(Message *m) 
 { 
   if (m->get_type() == MSG_OSD_MARK_ME_DOWN) { 
     service.got_stop_ack(); 
     m->put(); 
     return true; 
   } 

   // lock! 

   osd_lock.Lock(); // the osd_lock is already held by the calling thread here. 
  
   ... 
 
 int OSD::shutdown() 
 { 
   if (!service.prepare_to_stop()) 
     return 0; // already shutting down 
   osd_lock.Lock(); // the shutdown procedure try to hold the same lock again 
   
   ... ..._ 
  
 </code></pre> 
 The real reason why the OSD process fails to process the CEPH_MSG_OSD_MAP message is that  
 " -5592> 2015-09-07 11:09:33.006845 7fba19fb1700    0 log_channel(default) log [WRN] : map e183299 wrongly marked me down" 
 and the cluster_messenger rebind process is failure(see another bug reported: BUG #13002 Accepter::bind won't work correctly in some exception cases). And my solution for this a little bit rare senario is this(for your information): 

 <pre><code class="cpp"> 
 int _int OSD::shutdown() 
 { 
   if (!service.prepare_to_stop()) 
     return 0; // already shutting down 

   

  * bool need_lock = !osd_lock.is_locked_by_me(); 
   if (need_lock) 
     osd_lock.Lock(); osd_lock.Lock();* 
   if (is_stopping()) { 
     
   *    if (need_lock) 
       osd_lock.Unlock(); osd_lock.Unlock();* 
     return 0; 
   } 
   ... 
  
   *    if (!need_lock) 
     osd_lock.Lock();//restore lock status, the caller shall take control of it.* 
   return r; 
 } 
 </code></pre> }_

Back

Project

General

Profile

Ceph

Bug #13004