Project

General

Profile

Actions

Bug #8519

closed

msgr: deadlock, blocked on SimpleMessenger::lock

Added by Sage Weil almost 10 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

lots of threads stuck here:

#1  0x00007fd4efd2a065 in _L_lock_858 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007fd4efd29eba in __pthread_mutex_lock (mutex=0x28b00a0) at pthread_mutex_lock.c:61
#3  0x0000000000a32c23 in Mutex::Lock (this=0x28b0090, no_lockdep=<optimized out>) at common/Mutex.cc:89
#4  0x0000000000a58a86 in Locker (m=..., this=<synthetic pointer>) at ./common/Mutex.h:120
#5  SimpleMessenger::get_connection (this=0x28afd00, dest=...) at msg/SimpleMessenger.cc:370

ubuntu@teuthology:/a/teuthology-2014-06-02_02:30:05-rados-master-testing-basic-plana/285647

core file is on mira041 with matchin installed binaries

Actions #1

Updated by Sage Weil almost 10 years ago

(gdb) bt
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x0000000000b328d1 in Wait (mutex=..., this=0x317a470) at ./common/Cond.h:55
#2  Pipe::stop_and_wait (this=0x317a280) at msg/Pipe.cc:1412
#3  0x0000000000b45629 in Pipe::accept (this=0x3d40000) at msg/Pipe.cc:656
#4  0x0000000000b4a25d in Pipe::reader (this=0x3d40000) at msg/Pipe.cc:1423
#5  0x0000000000b4bfcd in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:49
#6  0x00007fd4efd27e9a in start_thread (arg=0x7fd4cebcb700) at pthread_create.c:308
#7  0x00007fd4ee2e83fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#8  0x0000000000000000 in ?? ()


is holding the lock
Actions #2

Updated by Sage Weil almost 10 years ago

  • Assignee set to Greg Farnum
Actions #3

Updated by Greg Farnum almost 10 years ago

Okay, the blocking thread (LWP 23351) is waiting for the "existing" pipe to finish dispatching (LWP 23231).
The existing pipe is blocked on the pg_map_lock held by LWP 20959 while trying to get_pg_or_queue_for_pg().
20959 is in PG::lock and blocked on PG::_lock while trying to process an OSD map (handle_osd_map->consume_map).
PG::_lock is held by LWP 20978, which is blocked on get_connection() while trying to process peering messages.

So this is a pretty nice cycle that needs to get broken.

Actions #4

Updated by Greg Farnum almost 10 years ago

Sage was hoping we could just rearrange things a bit and drop the SimpleMessenger lock while waiting for the Connection to finish fast_dispatching; unfortunately that's definitely not sufficient. We need to atomically:
1) Stop the existing pipe from processing elements
2) remove the existing pipe from the SimpleMessenger rank_pipe registry.
3) Add the new Pipe to the rank_pipe registry.

This is made more complicated because of course there could be a racing outgoing connection, so the only Messenger solution I can come up with (tell the existing pipe to stop fast_dispatching and start saving messages without holding the msgr->lock, then grab the lock once it's done dispatching) is insufficient. I think we're going to need to solve this in the OSD.
(Ideally, of course, we would make the OSD's fast_dispatch implementation actually lock-free, but I suspect there will be an easier if less elegant fix in how we handle osd maps or something.)

Actions #5

Updated by Greg Farnum almost 10 years ago

Unfortunately we didn't think to save the crash data, so it's all gone now. :(

But as I look at this again alongside Sam's work on #8396, I think that makes it easier to handle this. We're really only blocked because ms_fast_dispatch is blocking on getting a lock. I think we can restructure it in such a way that if we fail to grab the lock, we put the message on the waiting_for_pg list, but I'll have to think through the locking details a little more and discuss it with Sam when he's back.

Actions #6

Updated by Greg Farnum almost 10 years ago

  • Status changed from New to 7

wip-8519-osd-unblocking

I'll schedule a suite once it's going on the gitbuilders. Should add unit tests for the new NotifyingLock class before merging.

Actions #7

Updated by Greg Farnum almost 10 years ago

  • Category changed from msgr to OSD
  • Assignee changed from Greg Farnum to Samuel Just
  • Priority changed from Urgent to High

Giving this to Sam, as he didn't like my proposed solution. Downgrading from "Urgent" as we have yet to reproduce this (I think?).

Actions #8

Updated by Samuel Just almost 10 years ago

  • Status changed from 7 to 12
Actions #9

Updated by Samuel Just over 9 years ago

  • Status changed from 12 to Resolved
Actions

Also available in: Atom PDF