Project

General

Profile

Bug #12437

Mutex Assert from PipeConnection::try_get_pipe

Added by Mark Nelson over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
07/22/2015
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
firefly hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

This occured during a trial run of cbt's ceph_test_rados benchmark while OSD 3 was marked out/down or up/in in a loop. State transitions occured when "ceph health" no longer reported degraded, peering, recovery_wait, stuck, inactive, unclean, or recovery warnings.

     0> 2015-07-22 13:36:47.217698 7fed761ba700 -1 common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7fed761ba700 time 2015-07-22 13:36:47.213562
common/Mutex.cc: 95: FAILED assert(r == 0)

 ceph version 0.94.2-108-g45beb86 (45beb86423c3bd74dbafd36c6822e71ad9680e17)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x78) [0xbc9578]
 2: (Mutex::Lock(bool)+0x105) [0xb79ff5]
 3: (PipeConnection::try_get_pipe(Pipe**)+0x18) [0xca9828]
 4: (SimpleMessenger::submit_message(Message*, PipeConnection*, entity_addr_t const&, int, bool)+0x66) [0xba5a96]
 5: (SimpleMessenger::submit_message(Message*, PipeConnection*, entity_addr_t const&, int, bool)+0x427) [0xba5e57]
 6: (SimpleMessenger::_send_message(Message*, Connection*)+0x97) [0xba7977]
 7: (OSDService::send_message_osd_cluster(int, Message*, unsigned int)+0x1fe) [0x6aca9e]
 8: (PG::share_pg_info()+0x4d1) [0x7ed341]
 9: (ReplicatedPG::snap_trimmer()+0x603) [0x84f953]
 10: (OSD::SnapTrimWQ::_process(PG*)+0x1a) [0x6d709a]
 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0xbba226]
 12: (ThreadPool::WorkThread::entry()+0x10) [0xbbb2d0]
 13: (()+0x7ee5) [0x7fed95ee7ee5]
 14: (clone()+0x6d) [0x7fed949c5b8d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

osd.3.log.gz (544 KB) Mark Nelson, 07/22/2015 07:23 PM

ceph_test_rados.txt.gz (86.5 KB) Mark Nelson, 07/22/2015 07:36 PM

recovery.log.gz (13.7 KB) Mark Nelson, 07/22/2015 07:36 PM


Related issues

Duplicated by Ceph - Bug #12575: "Mutex.cc: 95: FAILED assert(r == 0)" Duplicate 08/03/2015
Copied to Ceph - Backport #12838: Mutex Assert from PipeConnection::try_get_pipe Resolved
Copied to Ceph - Backport #12839: Mutex Assert from PipeConnection::try_get_pipe Resolved 07/22/2015

Associated revisions

Revision a140085f (diff)
Added by David Zafman over 3 years ago

osd: Keep a reference count on Connection while calling send_message()

Fixes: #12437

Signed-off-by: David Zafman <>

Revision c94fd926 (diff)
Added by David Zafman over 3 years ago

osd: Keep a reference count on Connection while calling send_message()

Fixes: #12437

Signed-off-by: David Zafman <>
(cherry picked from commit a140085f467889f2743294a3c150f13b62fcdf51)

Revision f39c7917 (diff)
Added by Nathan Cutler over 3 years ago

osd: Keep a reference count on Connection while calling send_message()

Fixes: #12437

Signed-off-by: David Zafman <>
(manual backport of commit a140085)

Conflicts: src/osd/OSD.cc
master has share_map_peer; firefly has osd->_share_map_outgoing

History

#1 Updated by Mark Nelson over 3 years ago

  • Assignee set to David Zafman

#2 Updated by Samuel Just over 3 years ago

  • Priority changed from Normal to Urgent

#3 Updated by Mark Nelson over 3 years ago

FWIW, this appears the have happened suspiciously close to a state transition where OSD 3 was marked down/out:

[Wed Jul 22 13:36:45 CDT 2015] Cluster appears to have healed.
[Wed Jul 22 13:36:46 CDT 2015] Cluster is healthy, but repeat is set.  Moving to markdown state.
[Wed Jul 22 13:36:47 CDT 2015] Marking OSD 3 down.
[Wed Jul 22 13:36:48 CDT 2015] Marking OSD 3 out.
[Wed Jul 22 13:36:48 CDT 2015] Waiting for the cluster to break and heal

I've included the recovery log and ceph_test_rados output as well to show which operations were in flight at the time of the assert.

#4 Updated by David Zafman over 3 years ago

I used eclipse to determine the routines that reference Connection::lock. In Pipe::read_message() there is a lock/unlock pair with no code path around the unlock. In all other cases the Mutex::Locker is used so that a destructor will perform the unlock. There is no missing unlock, and the stack trace shows that there was no recursive code path that would cause the lock to be attempted to be locked twice.

I think there are only 2 possibilities left. Either there was a memory corruption which will be hard to find, or the Connection was destructed and EINVAL was returned because pthread_mutex_destroy() had been called on the Mutex.

#5 Updated by Haomai Wang over 3 years ago

I prefer the connection is destructed. For example, when OSDService calling send_message_osd_cluster and get connection, it only get the pointer to connection instead of ConnectionRef which will increase ref. After the first try_get_pipe, the Connection released and submit_message will try to call the connection's method and failed to lock.

#6 Updated by David Zafman over 3 years ago

Haomai Wang wrote:

I prefer the connection is destructed. For example, when OSDService calling send_message_osd_cluster and get connection, it only get the pointer to connection instead of ConnectionRef which will increase ref. After the first try_get_pipe, the Connection released and submit_message will try to call the connection's method and failed to lock.

Yes, we needed the ConnectionRef there while calling send_message().

#7 Updated by David Zafman over 3 years ago

  • Status changed from New to Testing

#8 Updated by David Zafman over 3 years ago

  • Status changed from Testing to Pending Backport
  • Backport set to firefly hammer

#10 Updated by Nathan Cutler about 3 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF