Actions
Bug #13004
closedOSD deadlocked.
% Done:
0%
Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Description
Below here is the error log from an unexpectedly doomed OSD:
2015-09-07 11:10:35.989338 7fba19fb1700 0 osd.10 183299 prepare_to_stop starting shutdown 2015-09-07 11:10:35.991902 7fba19fb1700 -1 common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7fba19fb1700 time 2015-09-07 11:10:35.989362 common/Mutex.cc: 95: FAILED assert(r == 0) ceph version 0.87.2 (87a7cec9ab11c677de2ab23a7668a77d2f5b955e) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xa4) [0xbacb54] 2: (Mutex::Lock(bool)+0x105) [0xb5b9e5] 3: (OSD::shutdown()+0x7f) [0x671c4f] 4: (OSD::handle_osd_map(MOSDMap*)+0x19d2) [0x6aa4b2] 5: (OSD::_dispatch(Message*)+0x41b) [0x6ac38b] 6: (OSD::ms_dispatch(Message*)+0x267) [0x6ac8a7] 7: (DispatchQueue::entry()+0x62a) [0xc53eaa] 8: (DispatchQueue::DispatchThread::entry()+0xd) [0xb88b4d] 9: (()+0x7df3) [0x7fba2c355df3] 10: (clone()+0x6d) [0x7fba2ae3854d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
The direct cause of this assert is that the specified Dispatch thread is trying to obtain the osd_lock
during the OSD::shutdown procedure which is already held by itself in the earlier OSD::ms_dispach procedure, see below:
bool OSD::ms_dispatch(Message *m)
{
if (m->get_type() == MSG_OSD_MARK_ME_DOWN) {
service.got_stop_ack();
m->put();
return true;
}
// lock!
osd_lock.Lock(); // the osd_lock is already held by the calling thread here.
...
int OSD::shutdown()
{
if (!service.prepare_to_stop())
return 0; // already shutting down
osd_lock.Lock(); // the shutdown procedure try to hold the same lock again
...
The real reason why the OSD process fails to process the CEPH_MSG_OSD_MAP message is that
" -5592> 2015-09-07 11:09:33.006845 7fba19fb1700 0 log_channel(default) log [WRN] : map e183299 wrongly marked me down"
and the cluster_messenger rebind process is failure(see another bug reported: BUG #13002 Accepter::bind won't work correctly in some exception cases). And my solution for this a little bit rare senario is this(for your information):
int OSD::shutdown()
{
if (!service.prepare_to_stop())
return 0; // already shutting down
bool need_lock = !osd_lock.is_locked_by_me();
if (need_lock)
osd_lock.Lock();
if (is_stopping()) {
if (need_lock)
osd_lock.Unlock();
return 0;
}
...
if (!need_lock)
osd_lock.Lock();//restore lock status, the caller shall take control of it.*
return r;
}
Updated by Kefu Chai over 8 years ago
- Status changed from New to Fix Under Review
- Assignee set to xie xingguo
Updated by Sage Weil over 8 years ago
- Status changed from Fix Under Review to Resolved
Actions