Project

General

Profile

Actions

Bug #12437

closed

Mutex Assert from PipeConnection::try_get_pipe

Added by Mark Nelson over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
firefly hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This occured during a trial run of cbt's ceph_test_rados benchmark while OSD 3 was marked out/down or up/in in a loop. State transitions occured when "ceph health" no longer reported degraded, peering, recovery_wait, stuck, inactive, unclean, or recovery warnings.

     0> 2015-07-22 13:36:47.217698 7fed761ba700 -1 common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7fed761ba700 time 2015-07-22 13:36:47.213562
common/Mutex.cc: 95: FAILED assert(r == 0)

 ceph version 0.94.2-108-g45beb86 (45beb86423c3bd74dbafd36c6822e71ad9680e17)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x78) [0xbc9578]
 2: (Mutex::Lock(bool)+0x105) [0xb79ff5]
 3: (PipeConnection::try_get_pipe(Pipe**)+0x18) [0xca9828]
 4: (SimpleMessenger::submit_message(Message*, PipeConnection*, entity_addr_t const&, int, bool)+0x66) [0xba5a96]
 5: (SimpleMessenger::submit_message(Message*, PipeConnection*, entity_addr_t const&, int, bool)+0x427) [0xba5e57]
 6: (SimpleMessenger::_send_message(Message*, Connection*)+0x97) [0xba7977]
 7: (OSDService::send_message_osd_cluster(int, Message*, unsigned int)+0x1fe) [0x6aca9e]
 8: (PG::share_pg_info()+0x4d1) [0x7ed341]
 9: (ReplicatedPG::snap_trimmer()+0x603) [0x84f953]
 10: (OSD::SnapTrimWQ::_process(PG*)+0x1a) [0x6d709a]
 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0xbba226]
 12: (ThreadPool::WorkThread::entry()+0x10) [0xbbb2d0]
 13: (()+0x7ee5) [0x7fed95ee7ee5]
 14: (clone()+0x6d) [0x7fed949c5b8d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Files

osd.3.log.gz (544 KB) osd.3.log.gz Mark Nelson, 07/22/2015 07:23 PM
ceph_test_rados.txt.gz (86.5 KB) ceph_test_rados.txt.gz Mark Nelson, 07/22/2015 07:36 PM
recovery.log.gz (13.7 KB) recovery.log.gz Mark Nelson, 07/22/2015 07:36 PM

Related issues 3 (0 open3 closed)

Has duplicate Ceph - Bug #12575: "Mutex.cc: 95: FAILED assert(r == 0)"Duplicate08/03/2015

Actions
Copied to Ceph - Backport #12838: Mutex Assert from PipeConnection::try_get_pipeResolvedSamuel JustActions
Copied to Ceph - Backport #12839: Mutex Assert from PipeConnection::try_get_pipeResolvedLoïc Dachary07/22/2015Actions
Actions #1

Updated by Mark Nelson over 8 years ago

  • Assignee set to David Zafman
Actions #2

Updated by Samuel Just over 8 years ago

  • Priority changed from Normal to Urgent

Updated by Mark Nelson over 8 years ago

FWIW, this appears the have happened suspiciously close to a state transition where OSD 3 was marked down/out:

[Wed Jul 22 13:36:45 CDT 2015] Cluster appears to have healed.
[Wed Jul 22 13:36:46 CDT 2015] Cluster is healthy, but repeat is set.  Moving to markdown state.
[Wed Jul 22 13:36:47 CDT 2015] Marking OSD 3 down.
[Wed Jul 22 13:36:48 CDT 2015] Marking OSD 3 out.
[Wed Jul 22 13:36:48 CDT 2015] Waiting for the cluster to break and heal

I've included the recovery log and ceph_test_rados output as well to show which operations were in flight at the time of the assert.

Actions #4

Updated by David Zafman over 8 years ago

I used eclipse to determine the routines that reference Connection::lock. In Pipe::read_message() there is a lock/unlock pair with no code path around the unlock. In all other cases the Mutex::Locker is used so that a destructor will perform the unlock. There is no missing unlock, and the stack trace shows that there was no recursive code path that would cause the lock to be attempted to be locked twice.

I think there are only 2 possibilities left. Either there was a memory corruption which will be hard to find, or the Connection was destructed and EINVAL was returned because pthread_mutex_destroy() had been called on the Mutex.

Actions #5

Updated by Haomai Wang over 8 years ago

I prefer the connection is destructed. For example, when OSDService calling send_message_osd_cluster and get connection, it only get the pointer to connection instead of ConnectionRef which will increase ref. After the first try_get_pipe, the Connection released and submit_message will try to call the connection's method and failed to lock.

Actions #6

Updated by David Zafman over 8 years ago

Haomai Wang wrote:

I prefer the connection is destructed. For example, when OSDService calling send_message_osd_cluster and get connection, it only get the pointer to connection instead of ConnectionRef which will increase ref. After the first try_get_pipe, the Connection released and submit_message will try to call the connection's method and failed to lock.

Yes, we needed the ConnectionRef there while calling send_message().

Actions #7

Updated by David Zafman over 8 years ago

  • Status changed from New to 7
Actions #8

Updated by David Zafman over 8 years ago

  • Status changed from 7 to Pending Backport
  • Backport set to firefly hammer
Actions #10

Updated by Nathan Cutler over 8 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF