Bug #13004: OSD deadlocked. - Ceph - Ceph

Actions

Copy link

Bug #13004

closed

OSD deadlocked.

Added by xie xingguo over 8 years ago. Updated over 8 years ago.

Status:

Resolved

Priority:

High

Assignee:

xie xingguo

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v0.94.2

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Below here is the error log from an unexpectedly doomed OSD:

2015-09-07 11:10:35.989338 7fba19fb1700  0 osd.10 183299 prepare_to_stop starting shutdown
2015-09-07 11:10:35.991902 7fba19fb1700 -1 common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7fba19fb1700 time 2015-09-07 11:10:35.989362
common/Mutex.cc: 95: FAILED assert(r == 0)

 ceph version 0.87.2 (87a7cec9ab11c677de2ab23a7668a77d2f5b955e)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xa4) [0xbacb54]
 2: (Mutex::Lock(bool)+0x105) [0xb5b9e5]
 3: (OSD::shutdown()+0x7f) [0x671c4f]
 4: (OSD::handle_osd_map(MOSDMap*)+0x19d2) [0x6aa4b2]
 5: (OSD::_dispatch(Message*)+0x41b) [0x6ac38b]
 6: (OSD::ms_dispatch(Message*)+0x267) [0x6ac8a7]
 7: (DispatchQueue::entry()+0x62a) [0xc53eaa]
 8: (DispatchQueue::DispatchThread::entry()+0xd) [0xb88b4d]
 9: (()+0x7df3) [0x7fba2c355df3]
 10: (clone()+0x6d) [0x7fba2ae3854d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

The direct cause of this assert is that the specified Dispatch thread is trying to obtain the osd_lock
during the OSD::shutdown procedure which is already held by itself in the earlier OSD::ms_dispach procedure, see below:

bool OSD::ms_dispatch(Message *m)
{
  if (m->get_type() == MSG_OSD_MARK_ME_DOWN) {
    service.got_stop_ack();
    m->put();
    return true;
  }

  // lock!

  osd_lock.Lock(); // the osd_lock is already held by the calling thread here.

  ...

int OSD::shutdown()
{
  if (!service.prepare_to_stop())
    return 0; // already shutting down
  osd_lock.Lock(); // the shutdown procedure try to hold the same lock again

  ...

The real reason why the OSD process fails to process the CEPH_MSG_OSD_MAP message is that
" -5592> 2015-09-07 11:09:33.006845 7fba19fb1700 0 log_channel(default) log [WRN] : map e183299 wrongly marked me down"
and the cluster_messenger rebind process is failure(see another bug reported: BUG #13002 Accepter::bind won't work correctly in some exception cases). And my solution for this a little bit rare senario is this(for your information):

int OSD::shutdown()
{
  if (!service.prepare_to_stop())
    return 0; // already shutting down

  bool need_lock = !osd_lock.is_locked_by_me();
  if (need_lock)
    osd_lock.Lock();
  if (is_stopping()) {
    if (need_lock)
      osd_lock.Unlock();
    return 0;
  }
  ...

  if (!need_lock)
    osd_lock.Lock();//restore lock status, the caller shall take control of it.*
  return r;
}